Some customers have a large volume of historical data that needs to be processed in our Cloud::Hadoop. Others are trying to power data-driven web or mobile applications with our Cloud::Queries powered by a scalable, NoSQL database such as HBase or Elasticsearch.
But there’s one use case that keeps popping up across our customers, industries and across nearly all use cases: streaming aggregation of high-throughput data to be used to power dynamic customer-facing applications and dashboards for internal business users.
Why is Streaming Aggregation Such a Challenge?
Here are a couple of example use cases that demand of the streaming aggregation use case:
- Retail: You have 10s of millions of customers and millions of products. Your daily transaction volumes are enormous (e.g. up to 10M events per second for some of our bigger online retailers) but they’re also at a very fine a level of detail. When reporting, you want to see data aggregated by product or by customer so you can do trending, correlation, market basket, etc., kinds of analyses.
- Ad Tech: You generate 100s of millions of ads, pixels, impressions, clicks, conversions, each day. It’s uninteresting to track each event separately; you care about how a particular advertiser or campaign is doing. You need to provide real-time dashboards, which show performance and the value of your service to your advertisers over a dataset which can be queried ad-hoc or interactively.
Sound familiar? Do you:
- Have more than 1 M+ new records per day delivered continuously (~10K new records / sec)? This is when things begin to get interesting.
- Aggregate the input data on a subset of its dimensions? Say 100 dimensions?
- Store the aggregated inputs for several months or years? So that you can analyze trends over time?
- And, demand the ability to create dashboards or use business intelligence tools to slice and dice this data in a variety of ways?
- Would you like to have the original input records available when you need them? Just in case your questions change later?
If you answered yes to some or all of these questions, then you need to investigate the stream aggregation services offered by Infochimps Cloud for Big Data.
But before we get into the benefits of using Infochimps’ cloud services to solve this problem, let me first describe some other approaches and why they ultimately can fail (see our recent survey here>>).
The Traditional Approach
The traditional approach to solving a streaming aggregation problem leverages only the traditional 3-tier web application stack of web client (browser), web/application server, and SQL database.
Many organizations start out with this technology when their applications are still new. Their initial success leads to growth, which leads to more input data, which leads to users and partners demanding more transparency and insight into their product: so BI dashboards become an important aspect of managing your business effectively.
The traditional web stack provides enough data processing power during the early days, but as data volumes grow, the process of dumping raw records into your SQL database and aggregating them once nightly no longer scales.
24 hours of data from the previous day starts taking 3-4 then 7-8, then 12-13 hours to process. Ever experience this? A problem almost over 300 IT professionals told us about in a recent survey, had to do with this issue of a nightly aggregation step that, many times, leads to many days of frustrating downtime or critical delays in the business or in the worst case a situation where you simply never can fix this scaling issue referred to as the “horizon of futility” — the moment when the amount of time taken to aggregate a given amount of data is equal to the amount of time taken to generate that data.
Are you using challenged with this scenario? Do you:
- Rely on an SQL database like Oracle, SQL Server, MySQL, or PostgreSQL to store all your data?
- Use this same database to calculate your aggregates in a (nightly?) batch process?
- Grudgingly tolerate the slow down in the performance of the application during periods in which your database is executing your batch jobs?
- Have data losses or long delays between input data and output graphs?
- Feel overburdened by the operations workload of pushing the 3-tier technology stack?
If so, maybe you’ve already taken some of evolutionary steps using new webscale technologies such as Hadoop…
Or should we say that your heart is in the right place, but the solution still falls short of expectations. Organizations confronted with streaming aggregation problems usually correctly identify one of the symptoms of lack of scalability in their infrastructure. Unfortunately, they often choose an approach to scale which is already known to them or easy to hire for: scale up your webservers and your existing SQL database(s), and then add Hadoop!
They make this choice because it is easy and it is incremental. Adding “just a little more RAM” to an SQL database may sound like the right approach, may often work just fine in the truly early days, but soon becomes unmanageable as the figure of merit — speedup in batch job per dollar spent (aka price-performance) on RAM for the database becomes lower and lower as data volumes increase. This becomes even more costly as the organization needs to scale up resources to just “keep the lights on” with such an infrastructure.
Scaling of web services is often handled by spawning additional web servers (also referred to as ‘horizontally scaling’), which is a fine solution for the shared-nothing architecture of a web application. This approach, when applied to critical analytic data infrastructure, leads to the “SQL database master-slave replication and sharding” scenario that is supported by so many DBAs in the enterprise today.
What About Hadoop?
Confronted with some of these problems, organizations will often start attending Big Data conferences and learn about Hadoop, a batch processing technology at the very tip of the Big Data spear. This leads either to a search for talent where organizations quickly realize that Hadoop engineers and sys admins are incredibly rare resources; or internal teams get pulled from existing projects to build the “Hadoop cluster”. These are exciting times for internal staff, until after a period of time where the organization has a functioning Hadoop cluster, albeit at a great internal operations cost and after many months of critical business delay. This Hadoop cluster may even work, happily calculating aggregate metrics from data collected in streams the prior day, and even at orders of magnitude faster than with the Traditional Approach above.
Organizations who arrive at this point in their adoption of Big Data infrastructure then uneasily settle into believing they’ve solved their streaming aggregation problem with a newfangled batch-processing system with Hadoop. But many folks in the organization will then realize that:
- They are spending too much time on operations and not enough time on product or business needs as engineering struggles with educating the organization on how to use these new technologies it doesn’t understand.
- They are still stuck solving a fundamentally real-time problem with a batch-solution.
- Their sharded approach is only delaying the inevitable.
How Does Facebook, Twitter, Linkedin, etc. Do It?
It’s not surprising that Hadoop is the first Big Data technology brought in by many organizations. Google and then Yahoo! set the stage. But what they didn’t tell you was that is “yesterday’s approach”. So how do webscale companies like Facebook and the like to things today? Yes, Hadoop is powerful and it’s been around longer than many other Big Data technologies, and it has great PR behind it. But Hadoop isn’t necessarily the (complete) answer to every Big Data problem.
The streaming aggregation problem is by its nature real-time. An aggregation framework that works in real-time is the ideal solution.
Infochimps Cloud::Streams provides this real-time aggregation because it is built on top of leading stream processing frameworks used by leaders TODAY. Records can be ingested, processed, cleaned, joined, and — most importantly for all use cases — aggregated into time and field based bins in real-time: the bin for “this hour” or “this day” contain data from this second.
This approach is extremely powerful for solving the use cases defined above because:
- Aggregated data is immediately available in downstream data stores and analysis (do you care to act on data now or hours, days, later?).
- Raw data can be written to the same or a number of data stores for different kinds of processing to occur later. Not every data store is equal. You may need several to accommodate the organizations needs.
- By NOT waiting for a batch job to complete means that data pipeline or analytics errors are IMMEDIATELY detected as they occur — and immediately recovered from — instead of potentially adding days of delay due to the failure of long-running batch jobs.
- Ingestion and aggregation are decoupled from storing and serving historical data so applications are more robust.
Infochimps Cloud and its streaming services is more than just a point product: it’s a suite of data analytics services addressing your streaming, ad-hoc/interactive query, and batch analytics needs all in an integrated solution that you can take advantage of within 30 days. It is also offered as a private cloud service managed by dedicated support and operations engineers who are experts at Big Data. This means you get all the benefits of Big Data technologies without having to bear the tremendous operations burden they incur.
What Comes Next?
We covered how Hadoop isn’t a good solution to the streaming aggregation problem but that doesn’t mean it isn’t useful. On the contrary, long-term historical analysis of raw data collected by a streaming aggregation service is crucial to developing deeper insights than are available in real-time.
That’s why the Infochimps Cloud for Big Data also includes Hadoop. Collect and aggregate data in real-time and then spin up a dynamic Hadoop cluster every weekend to process weekly trends. The combination of real-time responsiveness and insight from long-time-scale analysis creates a powerful approach to harnessing a high throughput stream of information for business value.
Dhruv Bansal is the Chief Science Officer and Co-Founder of Infochimps, He holds a B.A. in Math and Physics from Columbia University in New York and attended graduate school in Physics at the University of Texas at Austin. For more information, email Dhruv at firstname.lastname@example.org or follow him on Twitter at @dhruvbansal.