Pop Data

3 Tiers: What Infochimps and Netflix Have in Common

Infochimps Cloud 300x150 3 Tiers: What Infochimps and Netflix Have in CommonA recent article on Gigaom, “3 shades of latency: How Netflix built a data architecture around timeliness”, shines some light on how the best-in-class architecture for Big Data has 3 different levels, separated by the dimension of “timeliness”.

“Netflix knows that processing and serving up lots of data — some to customers, some for use on the backend — doesn’t have to happen either right away or never. It’s more like a gray area, and Netflix detailed the uses for three shades of gray — online, offline and nearline processing.”

Just as Netflix defined their “three shades of gray”, Infochimps defined the three shades through our three cloud services: Cloud::Streams (real-time processing / online), Cloud::Queries (near real-time processing / nearline), and Cloud::Hadoop (batch processing /offline). By satisfying all aspects along the time dimension, companies unlock the ability to handle virtually any use case. Collect data in real-time, or import it in batch. Process data and generate insights as it flows, or do it in large-scale historical jobs. Choose your Big Data analysis adventure by mixing and matching approaches.

The article highlights how this approach “is fairly common among web companies that understand that different applications can tolerate different latencies”. Just as LinkedIn and Facebook were mentioned sharing the same general theory, working with Infochimps will provide you the benefits from a similar architecture; delivering the superior “3 tier approach” to Big Data.




6fefa857 2e95 4742 9684 869168ac7099 3 Tiers: What Infochimps and Netflix Have in Common



Big Data’s Evolution: 5 Things That Might Surprise You

Evolution of Big Data Big Data’s Evolution: 5 Things That Might Surprise You  Over the past several years, Big Data has gone from being a somewhat obscure concept to a genuine business buzzword. As is often the case with buzzwords, when you dig a little deeper you find that many people have substantial misconceptions about what Big Data is, where it came from and where it is going.

Here are a few things that might surprise you about the evolution of Big Data:

  1. There are more “failures” out there than you’d think. We’re bombarded with the hype, but the reality is that this is still an early technology. As people are unfamiliar with the tech components of Big Data, they’re often prone to thinking that they can jump in and do everything themselves. However, the task of streaming and analyzing batch, near-real-time and real-time data in a comprehensible form is beyond the capabilities of most in-house IT departments, and will require outside expertise.

  2. It is an evolution, not a revolution. The topic of Big Data has exploded so quickly onto the media landscape that it’s easy to get the impression that it appeared from nowhere and is transforming business in a whirlwind. While the ultimate impact Big Data will have on business cannot be underestimated, its ascension has been much more incremental than media coverage might lead you to believe. Its earliest stages began more than a decade ago with the increasing focus on unstructured data, and since then companies have been steadily experimenting with and building capabilities and best practices. It’s important to make the distinction between evolution and revolution because viewing Big Data as revolutionary may lead to the temptation to dive in headlong without a real plan. The smart course of action involves identifying a very specific business challenge that you’d like to address with Big Data, and then expanding and iterating your program step-by-step.

  3. Big Business doesn’t yet ‘get’ Big Data. You’d think big enterprises would have captured a 360-degree view of their customers by now. But they haven’t, and evidence of this abounds in the sloppy marketing outreach efforts that everyone experiences on a daily basis. Two essential changes need to happen in order for enterprises to truly get a handle on Big Data:  1) corporations need to break down departmental silos and freely share customer insights organization-wide; and 2) they must start bringing the apps to the data, rather than bringing their data to the apps. Companies have been reluctant to embrace the cloud for sensitive, proprietary data due to security and other reasons. However, we now have the ability to build apps in virtual private clouds that reside in tier-4 data centers, eliminating the need for the expensive, risk-laden migrations that have stood in the way of enterprises’ ability to adopt effective Big Data strategies.

  4. Housing your own data is too cost-prohibitive. The old ways of doing things simply won’t work for Big Data — it’s just too big. While 10TB of legacy infrastructure costs in excess of $1M to store, the data warehouse for any significant company is going to be way past 20 TB. The math isn’t difficult —  housing your own data is super expensive. There’s no way that companies like Facebook and LinkedIn, for whom customer data is lifeblood, could have done it without leveraging the cloud. More and more, enterprises are discovering that they can achieve analytic insights from Big Data by deploying in the cloud.

  5. Hadoop alone won’t do it. Although Hadoop gets 80% of the Big Data attention, it’s only 20% of the solution. Predicting customer behavior is kind of like shooting a bullet with another bullet, and is going to require much more than a historical data perspective. Sure, Hadoop gets most of the press these days, but in order for enterprises to gain a truly customer-centric view they’ll need to tie together historical, real-time and near real-time data through a single, user-friendly interface that enables them to analyze and make decisions in-flight.

Dave Spenhoff is the VP of Marketing at Infochimps. He has helped emerging companies, including Inxight and MarkLogic, define and launch successful go-to-market strategies and has provided strategic marketing guidance to a number of technology companies.




6fefa857 2e95 4742 9684 869168ac7099 Big Data’s Evolution: 5 Things That Might Surprise You



There’s an app for that: Visualizing the Internet

“There’s an app for that.”

We’ve heard it many times, the spoken certainty that the necessities of the world are satisfied by an app.

We love cool apps as much as anyone, so FlowingData caught our attention again with this blog post: “App shows what the Internet looks like

Visualizing the Internet Theres an app for that: Visualizing the Internet

“In a collaboration between PEER 1 Hosting, Steamclock Software, and Jeff Johnston, the Map of the Internet app provides a picture of what the physical Internet looks like. Users can view Internet service providers (ISPs), Internet exchange points, universities and other organizations through two view options — Globe and Network. The app also allows users to generate a trace route between where they are located to a destination node, search for where popular companies and domains are, as well as identify their current location on the map.”

Now that’s a cool app.

Read more details here >> and download the app for free on iTunes.

Thank you FlowingData for providing interesting posts for us data nerds.




119efc1b cf09 4f4f 9085 057e76e0464c Theres an app for that: Visualizing the Internet




Image source: FlowingData.com

Streaming Data, Aggregated Queries, Real-Time Dashboards

Some customers have a large volume of historical data that needs to be processed in our Cloud::Hadoop. Others are trying to power data-driven web or mobile applications with our Cloud::Queries powered by a scalable, NoSQL database such as HBase or Elasticsearch.

But there’s one use case that keeps popping up across our customers, industries and across nearly all use cases: streaming aggregation of high-throughput data to be used to power dynamic customer-facing applications and dashboards for internal business users.

Why is Streaming Aggregation Such a Challenge?

Here are a couple of example use cases that demand of the streaming aggregation use case:

  • Retail: You have 10s of millions of customers and millions of products. Your daily transaction volumes are enormous (e.g. up to 10M events per second for some of our bigger online retailers) but they’re also at a very fine a level of detail. When reporting, you want to see data aggregated by product or by customer so you can do trending, correlation, market basket, etc., kinds of analyses.
  • Ad Tech: You generate 100s of millions of ads, pixels, impressions, clicks, conversions, each day. It’s uninteresting to track each event separately; you care about how a particular advertiser or campaign is doing. You need to provide real-time dashboards, which show performance and the value of your service to your advertisers over a dataset which can be queried ad-hoc or interactively.

Sound familiar? Do you:

  • Have more than 1 M+ new records per day delivered continuously (~10K new records / sec)? This is when things begin to get interesting.
  • Aggregate the input data on a subset of its dimensions? Say 100 dimensions?
  • Store the aggregated inputs for several months or years? So that you can analyze trends over time?
  • And, demand the ability to create dashboards or use business intelligence tools to slice and dice this data in a variety of ways?
  • Would you like to have the original input records available when you need them? Just in case your questions change later?

If you answered yes to some or all of these questions, then you need to investigate the stream aggregation services offered by Infochimps Cloud for Big Data.

But before we get into the benefits of using Infochimps’ cloud services to solve this problem, let me first describe some other approaches and why they ultimately can fail (see our recent survey here>>).

The Traditional Approach

The Traditional Approach1 Streaming Data, Aggregated Queries, Real Time Dashboards

The traditional approach to solving a streaming aggregation problem leverages only the traditional 3-tier web application stack of web client (browser), web/application server, and SQL database.

Many organizations start out with this technology when their applications are still new. Their initial success leads to growth, which leads to more input data, which leads to users and partners demanding more transparency and insight into their product: so BI dashboards become an important aspect of managing your business effectively.

The traditional web stack provides enough data processing power during the early days, but as data volumes grow, the process of dumping raw records into your SQL database and aggregating them once nightly no longer scales.

24 hours of data from the previous day starts taking 3-4 then 7-8, then 12-13 hours to process. Ever experience this? A problem almost over 300 IT professionals told us about in a recent survey, had to do with this issue of a nightly aggregation step that, many times, leads to many days of frustrating downtime or critical delays in the business or in the worst case a situation where you simply never can fix this scaling issue referred to as the “horizon of futility” — the moment when the amount of time taken to aggregate a given amount of data is equal to the amount of time taken to generate that data.

Are you using challenged with this scenario? Do you:

  • Rely on an SQL database like Oracle, SQL Server, MySQL, or PostgreSQL to store all your data?
  • Use this same database to calculate your aggregates in a (nightly?) batch process?
  • Grudgingly tolerate the slow down in the performance of the application during periods in which your database is executing your batch jobs?
  • Have data losses or long delays between input data and output graphs?
  • Feel overburdened by the operations workload of pushing the 3-tier technology stack?

If so, maybe you’ve already taken some of evolutionary steps using new webscale technologies such as Hadoop…

Half-Hearted Solution

Half a Solution2 Streaming Data, Aggregated Queries, Real Time Dashboards

Or should we say that your heart is in the right place, but the solution still falls short of expectations. Organizations confronted with streaming aggregation problems usually correctly identify one of the symptoms of lack of scalability in their infrastructure. Unfortunately, they often choose an approach to scale which is already known to them or easy to hire for: scale up your webservers and your existing SQL database(s), and then add Hadoop!

They make this choice because it is easy and it is incremental. Adding “just a little more RAM” to an SQL database may sound like the right approach, may often work just fine in the truly early days, but soon becomes unmanageable as the figure of merit — speedup in batch job per dollar spent (aka price-performance) on RAM for the database  becomes lower and lower as data volumes increase. This becomes even more costly as the organization needs to scale up resources to just “keep the lights on” with such an infrastructure.

Scaling of web services is often handled by spawning additional web servers (also referred to as ‘horizontally scaling’), which is a fine solution for the shared-nothing architecture of a web application. This approach, when applied to critical analytic data infrastructure, leads to the “SQL database master-slave replication and sharding” scenario that is supported by so many DBAs in the enterprise today.

What About Hadoop?

Confronted with some of these problems, organizations will often start attending Big Data conferences and learn about Hadoop, a batch processing technology at the very tip of the Big Data spear. This leads either to a search for talent where organizations quickly realize that Hadoop engineers and sys admins are incredibly rare resources; or internal teams get pulled from existing projects to build the “Hadoop cluster”. These are exciting times for internal staff, until after a period of time where the organization has a functioning Hadoop cluster, albeit at a great internal operations cost and after many months of critical business delay. This Hadoop cluster may even work, happily calculating aggregate metrics from data collected in streams the prior day, and even at orders of magnitude faster than with the Traditional Approach above.

Organizations who arrive at this point in their adoption of Big Data infrastructure then uneasily settle into believing they’ve solved their streaming aggregation problem with a newfangled batch-processing system with Hadoop. But many folks in the organization will then realize that:

  • They are spending too much time on operations and not enough time on product or business needs as engineering struggles with educating the organization on how to use these new technologies it doesn’t understand.
  • They are still stuck solving a fundamentally real-time problem with a batch-solution.
  • Their sharded approach is only delaying the inevitable.

How Does Facebook, Twitter, Linkedin, etc. Do It?

Multiple Applications Streaming Data, Aggregated Queries, Real Time Dashboards

It’s not surprising that Hadoop is the first Big Data technology brought in by many organizations. Google and then Yahoo! set the stage. But what they didn’t tell you was that is “yesterday’s approach”. So how do webscale companies like Facebook and the like to things today? Yes, Hadoop is powerful and it’s been around longer than many other Big Data technologies, and it has great PR behind it. But Hadoop isn’t necessarily the (complete) answer to every Big Data problem.

The streaming aggregation problem is by its nature real-time.  An aggregation framework that works in real-time is the ideal solution.

Infochimps Cloud::Streams provides this real-time aggregation because it is built on top of leading stream processing frameworks used by leaders TODAY.  Records can be ingested, processed, cleaned, joined, and — most importantly for all use cases — aggregated into time and field based bins in real-time: the bin for “this hour” or “this day” contain data from this second.

This approach is extremely powerful for solving the use cases defined above because:

  • Aggregated data is immediately available in downstream data stores and analysis (do you care to act on data now or hours, days, later?).
  • Raw data can be written to the same or a number of data stores for different kinds of processing to occur later. Not every data store is equal. You may need several to accommodate the organizations needs.
  • By NOT waiting for a batch job to complete means that data pipeline or analytics errors are IMMEDIATELY detected as they occur — and immediately recovered from — instead of potentially adding days of delay due to the failure of long-running batch jobs.
  • Ingestion and aggregation are decoupled from storing and serving historical data so applications are more robust.

Infochimps Cloud and its streaming services is more than just a point product: it’s a suite of data analytics services addressing your streaming, ad-hoc/interactive query, and batch analytics needs all in an integrated solution that you can take advantage of within 30 days. It is also offered as a private cloud service managed by dedicated support and operations engineers who are experts at Big Data.  This means you get all the benefits of Big Data technologies without having to bear the tremendous operations burden they incur.

What Comes Next?

We covered how Hadoop isn’t a good solution to the streaming aggregation problem but that doesn’t mean it isn’t useful.  On the contrary, long-term historical analysis of raw data collected by a streaming aggregation service is crucial to developing deeper insights than are available in real-time.

That’s why the Infochimps Cloud for Big Data also includes Hadoop.  Collect and aggregate data in real-time and then spin up a dynamic Hadoop cluster every weekend to process weekly trends.  The combination of real-time responsiveness and insight from long-time-scale analysis creates a powerful approach to harnessing a high throughput stream of information for business value.

Dhruv Bansal is the Chief Science Officer and Co-Founder of Infochimps, He holds a B.A. in Math and Physics from Columbia University in New York and attended graduate school in Physics at the University of Texas at Austin. For more information, email Dhruv at dhruv@infochimps.com or follow him on Twitter at @dhruvbansal.




119efc1b cf09 4f4f 9085 057e76e0464c Streaming Data, Aggregated Queries, Real Time Dashboards



ZDNet Article Asks The Same Question: Why Wouldn’t You?

Toby Wolpe, senior reporter at ZDNet, recently wrote an article entitled “Big data: Why most businesses just don’t get it” highlighting findings from Gartner vice president and analyst Debra Logan.

Her  quote caught our eye about how acquiring big-data services from a third party could make sense:

  • “If it is cheap, if big data turns out to be something you can get from someone else, you can rent the infrastructure, you can ship a bunch of your data and you can just see what happens, then why not? Why wouldn’t you do that?”

Chimpmark ZDNet Article Asks The Same Question: Why Wouldnt You? Why wouldn’t you do that? Why wouldn’t you want to benefit from the fastest way to develop and deploy Big Data environments with Infochimps?

Wolpe also states, “one of the main barriers to asking the right questions of big data is a lack of expertise and a shortage of data scientists”. The Infochimps team is made of data scientists and cloud computing experts, available to help you effectively leverage Big Data, resulting in better data-driven decisions.




6fefa857 2e95 4742 9684 869168ac7099 ZDNet Article Asks The Same Question: Why Wouldnt You?



Customized, Intelligent, Vertical Applications – The Future of Big Data?

Future of Big Data Customized, Intelligent, Vertical Applications   The Future of Big Data?

The Ideal Big Data Application Development Environment

Lets assume that your entire organization had access to the following building blocks:

  • Data: All sources of data from the enterprise (at rest and in motion)
  • Analytics: Any/All Queries, Algorithms, Machine Learning Models
  • Application Business Logic: Domain specific use-cases / business problems
  • Actionable Insights: Knowledge of how to apply analytics against data through the use of application business logic to produce a positive impact to the business
  • Infrastructure Configuration: High scalable, distributed, enterprise-class infrastructure capable of combining data, analytics, with app logic to produce actionable insights

Imagine if your entire organization was empowered to produce data-driven applications tailored specifically for your vertical use-cases?

Data-Driven Vertical Apps

banking Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a regional bank who is under heavier regulation, focused on risk management, and expanding your mobile offerings. You are seeking ways to get ahead of your competition through the use of Big Data by optimizing financial decisions and yields.

What if there was an easy and automated way to define new data sources, create new algorithms, apply these to gain better insight into your risk position, and ultimately operationalize all this by improving your ability to reject and accept loans?

Retailer Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a retailer who is being affected by the economic downturn, demographic shifts, and new competition from online sources. You are seeking ways of leveraging the fact that your customers are empowered by mobile and social by transforming the shopping experience through the use of Big Data.

What if there was an easy and automated way to capture all customer touch points, create new segmentation and customer experience analytics, apply these to create a customized cross-channel solution which integrates online shopping with social media, personalized promotions, and relevant content?

Telecommunications Customized, Intelligent, Vertical Applications   The Future of Big Data?

You are a fixed line operator, wireless network provider, or fixed broadband provider who is in the middle of convergence of both services and networks, and feeling price pressures of existing services. You are seeking ways to leverage cloud and Big Data to create smarter networks (autonomous and self-analyzing), smarter operations (improving working efficiency and capacity of day-to-day operations), and ways to leverage subscriber demographic data to create new data products and services to partners.

What if there was an easy and automated way to start by consuming additional data across the organization, deploy segmentation analytics to better target customers and increase ARPU?

It Starts With The “Infrastructure Recipe”

Application Dev Team Customized, Intelligent, Vertical Applications   The Future of Big Data?OK. You are a member of the application development team. All you have to do is create a data-driven application “deploy package.” It’s your recipe of all the data sources, analytics, and application logic needed to insert into this magical cloud service that produces your industry and use-case specific application. You don’t need to be an analytics expert. You don’t need to be a DBA, an ETL expert or even a Big Data technologist. All you need is a clear understanding of your business problem, and you can assemble the parts through a simple-to-use “recipe” which is abstracted from the details of the infrastructure used to execute on that recipe.

Any Data Source

Data Source Customized, Intelligent, Vertical Applications   The Future of Big Data?Imagine an environment where your enterprise data is at your fingertips – no heavy ETL tools, no database exports, no Hadoop flume or sqoop jobs. Access to data is as simple as defining “nouns” in a sentence. Where your data lives is not a worry. You are equipped with the magic ability to simply define what the data source is and where it lives and accessing it is automated. You also care less whether the data is some large historic volume living in a relational database or whether it is real-time streaming event data.

Analytics Made Easy

Analytics Customized, Intelligent, Vertical Applications   The Future of Big Data?Imagine a world where you can pick from literally thousands of algorithms and apply them to any of the above data sources in part or in combination. You create one algorithm and can apply it to years of historic data and/or a stream of live real-time data. Also, imagine a world where configuring your data in a format that your algorithms can consume is made seamless. Lastly, your algorithms execute on infrastructure in a parallel, distributed, highly scalable way. Getting excited yet?

Focus on Applications With Actionable Insights

Actionable Insights Customized, Intelligent, Vertical Applications   The Future of Big Data?

Now lets embody this combination of analytics and data in a way that can actually be consumed and acted upon. Imagine a world where you can produce your insights and report on them with your BI tool of choice. That’s kind of exciting.

But what’s even more exciting is the ability to deploy your insights operationally through an application that leverages your domain expertise and understanding of the business logic associated with the targeted use-case you are solving. Translation – you can code up a Java, Python, PHP, or Ruby application that is light, simple, and easy to build/maintain. Why? Because the underlying logic normally embedded in ETL tools, separate analytics software tools, MapReduce code, NoSQL queries and stream processing logic is pushed up into the hands of application developers. Drooling yet?  Wait, it gets better.

Big Data, Cloud and The Enterprise

Big Data Cloud Customized, Intelligent, Vertical Applications   The Future of Big Data?

Lets take this entire application paradigm and automate it within an elastic cloud service purpose-built for the organization. You have the ability to submit your application “deploy packages” to be instantly processed without having to understand the compute infrastructure and, better yet, without having to understand the underlying data analytic services required to process your various data sources in real-time, near real-time or in batch modes.

Ok…if we had such an environment, we’d all be producing a ton of next-generation applications…data-driven, highly intelligent and specific to our industry and use-cases.

I’m ready…are you?

Jim Kaskade serves as CEO of Austin-based Infochimps, the leading Big Data Platform-as-a-Service provider. Jim is a visionary leader within both large as well as small company environments with over 25 years of experience building hi-tech businesses, leading startups in cloud computing enterprise software, software as a service (SaaS), online and mobile digital media, online and mobile advertising, and semiconductors from their founding to acquisition.




6fefa857 2e95 4742 9684 869168ac7099 Customized, Intelligent, Vertical Applications   The Future of Big Data?



5 Questions Framing Data-Driven Decisions

5 Data Driven Decision Questions 5 Questions Framing Data Driven DecisionsWhile data-driven decisions is nothing new (remember the rise of “decision support systems” and “business intelligence”?), it does seem that enterprises have a new urgency these days: Enterprises that make data-driven decisions are gaining benefits ranging from better customer insights, higher sales, more efficient operations and lower costs. What’s not to like with that?

Today, the “volume, velocity and variety” of data that enterprises have at their disposal is mind-bendingly greater than just a few years ago. And, enterprises are embracing the kind of real-time decision making that does not just run the business, it runs the business smarter. Whether driving better customer engagement (and sales) or enabling more efficient operations, big data has become an essential asset for the modern enterprise.

Earlier in my career I was an operations research analyst – kind of an early-day data scientist. There were 5 questions I always made sure to answer regarding any project I undertook. These questions frame an analytic process that underlies making effective data-driven decisions, and I think they are as applicable today as ever.

  1. Do I understand the decision to be made, especially the business factors that make this decision important?
  2. Do I have a model that captures the decision process? I.e., do I have an analytic framework, mathematical description, appropriate algorithms, etc., that describe the decision to an appropriate degree of detail. Part of this is picking the right algorithms.
  3. Do I have the data? This is pretty obvious: if data is going to drive a decision, you need to have the data. Even in today’s environment of an overabundance of data, it’s still important to make sure you have data appropriate for the model and the decision.
  4. Do I have the necessary computational infrastructure? This used to mean, can I run this on my PC or do I need to get time in the data center? Today it means, how can I get a cluster of Hadoop servers pumping data into a NoSql database to drive my analytics. Today’s infrastructure is much harder to master.
  5. Am I producing results that are driving the decision? If so, great. If not, maybe I got something wrong in #’s 1-4. Repeat 1-4 until satisfied.

Questions 1 and 5 are about the business. Since you know your business better than anyone, you’re pretty much on your own for these. Questions 2, 3 and 4 are about the data, analytics and computational infrastructure to get you the answers you need. There are plenty of companies that can help you here, in whole or in part. The important thing is to not get bogged down in the infrastructure. That’s where the Infochimps platform really shines. As quoted from a recent TechCrunch article, “Infochimps is one of a growing ecosystem of companies that are programming the knowledge of data scientists, statisticians and programmers into applications that businesspeople can use.”




34523bb2 2e50 4f42 88a1 5bd9ed0fddac 5 Questions Framing Data Driven Decisions




Image Source: beafields.com

[Infographic] Taming Big Data from Wikibon

Opening with a Big Data market forecast, to ending with a shout-out for all industries to embrace Big Data as the definitive source of competitive advantage, the following infographic from Wikibon personifies Big Data as a beast (data volumes are growing exponentially) that can be tamed (thanks to new approaches for processing, storing and analyzing).  It includes real-world Big Data use cases, which I appreciated. I was most amazed by how “decoding the human genome used to take ten years, but can now be done in 7 days.”

The quote from Kevin Weil, the Director of Product for Revenue at Twitter brings the benefit of valuable Big Data insights home: “It’s no longer hard to find the answer to a given question; the hard part is finding the right question and as questions evolve, we gain better insight into our ecosystem and our business.”

Scroll down, geek out on the infographic, and if you want more, check out an oldie but goodie article:  6 Illuminating Big Data Infographics

Taming Big Data [Infographic] Taming Big Data from Wikibon

Did you notice the chimp within the Big Data forecast?

Thank you Wikibon for posting this!





84493d0d e63a 4f96 ae8b 01f76694dc55 [Infographic] Taming Big Data from Wikibon



The 3 Waypoints of a Data Exploration

Part of our goal is to unlock the big data stack for exploratory analytics.

How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.

The 3 Waypoints of a Data Exploration:

  • What you knew — are they validated by the data?
  • What you suspect — how do your hypotheses agree with reality?
  • What you would have never suspected — something unpredictable in advance?

In Practice:
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message.  I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.

I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).

Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.

Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.

That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):

3 Waypoints 1024x742 The 3 Waypoints of a Data Exploration
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.

Now let’s look at the 3 Waypoints.

What We Knew: What I really mean by “knew”  is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:

  • Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
  • Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
  • Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).

What We Suspected:

First, about the clusters themselves:

  • Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
  • Japanese and Chinese are mashed together, because both use characters from the Han script.

Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.

Things we suspected about the connections:

  • Nearby countries will show more “mashups”.  Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
  • Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.

What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.

The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: http://www.revfad.com/flip.html) My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.

As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.

However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?

3 Waypoints Smiley The 3 Waypoints of a Data Exploration

Because looks like a smiley face.
The common bond across all of humanity is a smile.


6fefa857 2e95 4742 9684 869168ac7099 The 3 Waypoints of a Data Exploration


Is “Big Data” the Wrong Term?

It’s likely that, like myself, you have heard again and again about “big data“, its 3 V’s, and the Hadoop brand. Yes, volume, velocity, and variety of data are making it difficult to use traditional data solutions like BI cubes, relational databases, and bespoke data pipelines. The world needs new superheroes like Hadoop, NoSQL, NewSQL, DevOps, etc. to solve our woes.

Big Data Is Big Data the Wrong Term?

However, these new technologies and approaches have done much more than just solve the problems around petabytes of data and thousands of events per second. They are the right way to do data. That’s why I’m not convinced the term “big data” was a good choice for us to land on as an industry. It’s really “smart data” or “scalable data.” And despite my distaste for adding a version number to buzz phrases, even “Data 2.0” would be more apt.

If you are a CTO/CIO, system architect, manager, consultant, developer, sys admin, or simply an interested professional – my goal is to prompt some initial points on why big data constitutes a good approach to data management and analytics, regardless of the speed and quantity of data.

Scalable Data: Multi-Node Architecture and Infrastructure-as-Code

Multi-node systems with distributed, horizontally scalable systems are always the right way to do infrastructure, no matter the size of your data or the size of your IT team. This wasn’t always the case, but now multi-node systems are as easy to manage as single-node solutions. It’s so easy now because monitoring, logging, management software, and more are all baked right in; systems come to life in a coordinated fashion that hides all the complexity and scales as needed. You can test your infrastructure in the same way you test programming code. While manually testing a multi-node system may be difficult, testing a piece of code is straightforward.

One of the worst things that can happen to an IT team is having to manage major architecture changes. Using open source, multi-node technologies with an infrastructure-as-code foundation lets organizations grow organically and swap tools and software in and out as needed. Simply modify your infrastructure definitions, test your code, and deploy. Additionally, this kind of framework works perfectly with the DevOps approach to system management. Code repositories are collaborative and iterative – giving individual developers empowerment to directly manage infrastructure, while having the safeguards and tests in place to ensure reliability and quality.

Smart Data: Machine Learning and Data Science

You don’t have to have petabytes of data to begin implementing smart algorithms. To run your business more efficiently, you need to be predictive. You must forecast business and market trends before they happen so you can anticipate how to steer your organization. The companies that win will be the ones analyzing and understanding as much data as possible – building data science as a key competency. Big data tools are making it easier to work with data by providing tools like Mahout for machine learning, Hive for business intelligence queries, or R for statistical analysis, which can interface with Hadoop. Because of big data architecture, you can keep data fresh, use a larger swath of data, and use the newest, most powerful tools to perform the analysis and processing.

Agnostic Data: The Right Database for Each Job

New data pipelining frameworks enable real-time stream processing with multi-node scalability and the ability to fork or merge flows. What that means is, you can easily support multiple databases for multiple problems: columnar stores as primary data stores, relational databases for reporting, search databases for data exploration, graph databases for relationship data, document stores for unstructured data, etc. Because of data splitting/merging capabilities, and your DevOps infrastructure ensuring your databases have integrated monitoring and logging, the added burden of having more than one database is minimum. You just have to learn how to interface with the data through easy-to-use APIs and client libraries.

Holistic Data: Hadoop is Not The End All, Be All

Finally, let’s tackle Hadoop specifically. Hadoop is oriented around large-scale batch processing of data. But so much of what big data is includes databases, data integration/collection, real-time stream processing, and data exploration. Hadoop is not a one trick pony, but it’s also not the answer to every data problem known to man.

Frameworks like Flume, Storm, and S4 are making it easier to perform streaming processing such as collecting hundreds of tweets per seconds, thousands of ad impressions per second, or processing data in near real-time as data flows to its destination (whether a database, Hadoop filesystem, etc.). New database technologies are providing more powerful ways of querying data and building applications. R, Hive, Mahout, and more are providing better data scientist tools. Tableau, Pentaho, GoodData, and others are pushing the envelope with data visualization and big data dashboarding.

So…

Big data software and frameworks are the right foundation for data + data integration and collection + data science + statistical analysis + infrastructure management and administration + IT scaling + data-centric applications + data exploration and visualization. Often regardless of data size.

Your organization benefits from adopting these best practices early and working with vendors that understand your company’s problem isn’t just “oh no, I have too much data”. It’s all about return on investment. The big data approach lowers overhead, enables faster and more efficient IT infrastructure management, generates better insights, and puts them to work in your organization.


DeepDive 728px v3 Is Big Data the Wrong Term?

[Image Source]