There are a lot of people out there with a Terabyte problem but who lack a Petabyte problem — yet they are forced to try to make use of a stack developed to address Facebook, Yahoo and JP Morgans‘ Petabyte problem. Hadoop out of the box is oriented for achieving 100% utilization of fixed-sized clusters by 12, 50, 100+ person analytics teams. In contrast, the bulk of even forward-thinking enterprises are at the level of just having handed two PhD statisticians a copy of the elephant book, a mis-provisioned cluster, and a slap on the back with a directive to “go find us som’a that insight!”.
There are a few observations we’ve made about these other customers and their differentiated needs that I wanted to share, and point to how we seek to address these with our own product.
Our first major observation is that while Hadoop might headline the bill, streaming data delivery is the opening act that moves the most merchandise. Most of our customers on initial contact mention Hadoop by name — yet universally the first-delivered and most necessary component has been streaming data delivery into a scalable database and/or Hadoop.
In fact, we’ve had clients who excitedly purchased and setup a Hadoop cluster, and they had plenty of data they’d like to analyze, but had no data in their Hadoop cluster. It may seem obvious once pointed out that you need a way to feed data into your cluster. Enter modern open source tools such as Flume and Storm. Indeed, Flume was originally created to feed hungry Hadoop clusters with streaming log data.
What people are now realizing though is just how powerful streaming data delivery tools like these are — that you can realize a surprising amount of analytical power (and even visibility of data as well) while the data is still in flight. These realizations have driven the accelerated adoption of many of these open source streaming technologies, like Esper, Flume, and Storm. I’ve been using Hadoop since ’08, and the adoption demand of Storm outpaces even Hadoop’s ascent.
Another important feature set we evangelize and see validated is what an underlying cloud infrastructure enables for the enterprise. Cloud-enabled elasticity makes exploratory analytics transformatively more powerful, as companies can scale their infrastructure up and down as needed.
Contrasted to the Petabyte-companies, who focus on 100% cluster utilization, the target metric for a development cluster fit for the Terabyte-company is high downtime — the ability to go from 10 to 100 machines; back down to 10; then rest to 0 machines over the course of a job. Hadoop out-of-the-box doesn’t meet this target, which was one of the most interesting engineering challenges we’ve solved.
So where else does the cloud fit in the Hadoop use case? Being able to safely grow, shrink, and stop/restart Hadoop isn’t just a slider UX control, it’s a fundamental change in developer mindset and capabilities. For example, when we were a 6-person team with an AWS bill that rivaled our payroll, we would run parse stages of jobs on high CPU instances, then slam it shut mid-workflow and bring the cluster up on high memory instances for the graph-heavy stages. As our platform matured, we moved to giving each developer their own cluster; too often Chimp A needed 30 machines for 2 hours, while Chimp B needed 6 machines all day. Most companies would have to compromise with a 30-machine cluster running all day – we’ve been able to reject that approach.
Tuning a Hadoop job to your cluster is fiendishly difficult and time consuming; while tuning your cluster to the job is comparatively straightforward. Data Scientists at the Terabyte-company shouldn’t be pinned down by the difficulties of working with technologies that weren’t designed for them. By enabling Hadoop in an elastic context — public or private cloud, internal or outsourced — Infochimps and others working on these challenges are a big part of breaking it out to the larger market.