A couple of days ago, O’Reilly’s Edd Dumbill took a look at hot topics in data for the coming year. His five big data predictions for 2012 were absolutely spot on and illustrate the most important next steps in our ability to find insight in data: visualization, interactivity and abstraction. As CTO of a tech startup at the center of a lot of these conversations, I thought I’d share some of my thoughts on the matter as well. A full post with my 2012 big data predictions will be out in the next couple of weeks.
Moving Towards Insight, Not Data
Nobody wants data. It’s costly to store and bothersome gather, parse and make useful. What you do want is insight — the ability to see deeper and make better decisions. However, even these tools of the future can only take your internal data so far, though. At some point, it becomes important to widen your observation field to include explanatory variables to give your data context. In other words, at some point, it’s useful to bring in outside data to give your own internal data more shape and meaning.
Explanatory variables give your data context, which leads you to insight.
Data marketplaces such as Infochimps, Factual and Microsoft Azure will take off as CIOs discover that BI tools are for generating questions, not answers. For example, pretend you are a camping equipment store based in Reno, NV with an e-commerce site. After years of business, you understand your yearly sales cycle (peaks in the early spring and early fall), who your repeat customers are, who your local competitors are, etc. However, you may uncover curiosities that cannot be answered by internal data: why do sales peak in early to mid-August and why do so many folks request holds during that time? Why does your customer spread suddenly change from mostly folks in the Nevada area to folks from all over the country? Your business intelligence tools may fall short here. Explanatory variables from outside sources, including weather, popularity of competitors (Foursquare check-ins, web traffic, etc) and major festivals in your area (oh HAI Burning Man!) may help you get to the bottom of why your store suddenly becomes flooded with neon furry leg-warmer wearing hippies around Labor Day.
In other words, your data creates questions and outside data creates context and helps you answers questions. In short, the answer to the too much data problem is more data.
In his article, Edd talks about the idea of streaming data processing. I haven’t gotten to play with Esper yet, but Ilya Grigorik has — as he puts it, Esper lets you “store queries not data, process each event in real-time, and emit results when some query criteria is met.” You can read his overview here.
At a lower and simpler level, we here at Infochimps LOVE Flume. It does one thing, simply and well: gets data from over here to over there, perhaps doing things to it along the way. It was developed for reliable log handling, but there are so many wonderful ways to beautifully misuse Flume:
- We replaced a set of massive batch jobs to parse scraped data with a simple Flume decorator that turns raw JSON into database-ready records. The new data flow is not only near-real-time, but also simpler, more reliable, and more maintainable.
- If you’re shipping your weblogs around with Flume, ten lines of Ruby and a tiny little StatsD daemon are all you need to track in real-time the rate, response code and latency of every web request. Joining the Church of Graphs has never been easier.
- Trying to scale a distributed message queue (RabbitMQ, Resque, etc)? If you care about throughput more than latency, look carefully at Flume — you’re probably better off.
By the way, if timeseries are your bread and butter, check out our friends at Eidosearch.
Data Science Workflows and Tools for Fast Iteration
Edd describes how Hadoop’s batch-oriented processing to be sufficient for many uses, but argues that batch processing isn’t adequate for online and real-time needs. Spark/Mesos has great promise for the kind of interactive exploration Edd describes. (It’s as close as anyone’s gotten to a REPL for big data exploration.)
At infochimps, our workflow revolves around writing short, readable scripts. See our chapter in O’Reilly’s Hadoop the Definitive Guide for a case study on how we developed a toolkit for Hadoop streaming in a Ruby environment.
Visualization is King
The most important machine learning algorithms all, at their heart, exploit a graph of one sort or another. Yet simply seeing, in any form, a graph of significant scale is still an oafish RAM-exhausting slog. Far and away the best tool we’ve used is Gephi, an open source interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.
So, how essential is visualization, especially graph visualization, to quality data science? LinkedIn, home of one of the top data science teams in the world, used its consistently-brilliant/outrageously-unfair hiring advantage to bring aboard Mathieu Bastian, Gephi’s author. LinkedIn derives so much value in just being able to see and understand their data better that they’re funding the full-time development of Gephi, to the benefit of us all.