Monthly Archives March 2013

IE Invites: Big Data & Analytics for Retail

Big Data & Analytics for Retail:  June 20 – 21, 2013, Chicago
Gain Greater Customer Insight with Big Data & Analytics

Big Data & Analytics for Retail Summit brings together analytics executives and data scientists working in retail, eCommerce and consumer goods, offering unique insight into the innovations that are driving success in these industries.

Why you should attend:

Big Data Analytics for Retail 567x1024 IE Invites: Big Data & Analytics for Retail

Register Today >>

Image Source: IE

6fefa857 2e95 4742 9684 869168ac7099 IE Invites: Big Data & Analytics for Retail

Big Data’s Evolution: 5 Things That Might Surprise You

Evolution of Big Data Big Data’s Evolution: 5 Things That Might Surprise You  Over the past several years, Big Data has gone from being a somewhat obscure concept to a genuine business buzzword. As is often the case with buzzwords, when you dig a little deeper you find that many people have substantial misconceptions about what Big Data is, where it came from and where it is going.

Here are a few things that might surprise you about the evolution of Big Data:

  1. There are more “failures” out there than you’d think. We’re bombarded with the hype, but the reality is that this is still an early technology. As people are unfamiliar with the tech components of Big Data, they’re often prone to thinking that they can jump in and do everything themselves. However, the task of streaming and analyzing batch, near-real-time and real-time data in a comprehensible form is beyond the capabilities of most in-house IT departments, and will require outside expertise.

  2. It is an evolution, not a revolution. The topic of Big Data has exploded so quickly onto the media landscape that it’s easy to get the impression that it appeared from nowhere and is transforming business in a whirlwind. While the ultimate impact Big Data will have on business cannot be underestimated, its ascension has been much more incremental than media coverage might lead you to believe. Its earliest stages began more than a decade ago with the increasing focus on unstructured data, and since then companies have been steadily experimenting with and building capabilities and best practices. It’s important to make the distinction between evolution and revolution because viewing Big Data as revolutionary may lead to the temptation to dive in headlong without a real plan. The smart course of action involves identifying a very specific business challenge that you’d like to address with Big Data, and then expanding and iterating your program step-by-step.

  3. Big Business doesn’t yet ‘get’ Big Data. You’d think big enterprises would have captured a 360-degree view of their customers by now. But they haven’t, and evidence of this abounds in the sloppy marketing outreach efforts that everyone experiences on a daily basis. Two essential changes need to happen in order for enterprises to truly get a handle on Big Data:  1) corporations need to break down departmental silos and freely share customer insights organization-wide; and 2) they must start bringing the apps to the data, rather than bringing their data to the apps. Companies have been reluctant to embrace the cloud for sensitive, proprietary data due to security and other reasons. However, we now have the ability to build apps in virtual private clouds that reside in tier-4 data centers, eliminating the need for the expensive, risk-laden migrations that have stood in the way of enterprises’ ability to adopt effective Big Data strategies.

  4. Housing your own data is too cost-prohibitive. The old ways of doing things simply won’t work for Big Data — it’s just too big. While 10TB of legacy infrastructure costs in excess of $1M to store, the data warehouse for any significant company is going to be way past 20 TB. The math isn’t difficult —  housing your own data is super expensive. There’s no way that companies like Facebook and LinkedIn, for whom customer data is lifeblood, could have done it without leveraging the cloud. More and more, enterprises are discovering that they can achieve analytic insights from Big Data by deploying in the cloud.

  5. Hadoop alone won’t do it. Although Hadoop gets 80% of the Big Data attention, it’s only 20% of the solution. Predicting customer behavior is kind of like shooting a bullet with another bullet, and is going to require much more than a historical data perspective. Sure, Hadoop gets most of the press these days, but in order for enterprises to gain a truly customer-centric view they’ll need to tie together historical, real-time and near real-time data through a single, user-friendly interface that enables them to analyze and make decisions in-flight.

Dave Spenhoff is the VP of Marketing at Infochimps. He has helped emerging companies, including Inxight and MarkLogic, define and launch successful go-to-market strategies and has provided strategic marketing guidance to a number of technology companies.

6fefa857 2e95 4742 9684 869168ac7099 Big Data’s Evolution: 5 Things That Might Surprise You

There’s an app for that: Visualizing the Internet

“There’s an app for that.”

We’ve heard it many times, the spoken certainty that the necessities of the world are satisfied by an app.

We love cool apps as much as anyone, so FlowingData caught our attention again with this blog post: “App shows what the Internet looks like

Visualizing the Internet Theres an app for that: Visualizing the Internet

“In a collaboration between PEER 1 Hosting, Steamclock Software, and Jeff Johnston, the Map of the Internet app provides a picture of what the physical Internet looks like. Users can view Internet service providers (ISPs), Internet exchange points, universities and other organizations through two view options — Globe and Network. The app also allows users to generate a trace route between where they are located to a destination node, search for where popular companies and domains are, as well as identify their current location on the map.”

Now that’s a cool app.

Read more details here >> and download the app for free on iTunes.

Thank you FlowingData for providing interesting posts for us data nerds.

119efc1b cf09 4f4f 9085 057e76e0464c Theres an app for that: Visualizing the Internet

Image source:

SXSW Events: Free Food, Free Drinks, Free Swag

sxsw SXSW Events: Free Food, Free Drinks, Free SwagHere at Infochimps, we’re excited about the chaotic excitement that is SXSW. If you’re going to be in the great city of Austin, we’d love to meet up with you to talk Big Data. Join us at following events:

When: Today, Thursday, March 7 @ 4-10p CT
Where: The Omni Building – 701 Brazos Street, 16th Floor, Austin, TX 78701
Details: Visit with some of the best in Austin Tech, enjoy free food and beverages, and kick back before the melee that is SXSW Interactive! Come talk Big Data with us and grab some chimpy swag at the event headquarters – Omni Building, see you there! This event is hosted by Capital Factory, free, open to registrants, and SXSW badges are not required.
Register Now >>

When: Tomorrow, Friday, March 8 @ 6-8p CT
Where: Opal Divines – 700 W. 6th Street, Austin, TX 78701
Details: Talk Big Data with Infochimps at this Big Data Love Happy Hour. This event is hosted by Infochimps, free with cash bar, open to the public, and registrations and badges are not required.
For More Info >>

When: Saturday, March 9 @ 9:30a-12:00p CT
Where: Empire Garage – 604 E. 7th Street, Austin, TX 78701
Details: Join Infochimps at ff MASSIVE, an event full of panel discussions between ff portfolio CEOs and Executives, a Pavilion for showcasing companies, and a party including a headlining DJ and an open bar provided by Shellback Caribbean Rum and New Amsterdam Vodka. Come see Infochimps as we present “What is Big Data?” at 11:00a, participate in the “Big Data in 2013” panel at 11:30a, and “Bringing Enterprise up to Speed” at 4:30p. This event is hosted by ff Ventures, free, open to the public, and registrations and badges are not required.
For More Info >>

When: Sunday, March 10 @ 8:00p-12:00a CT
Where: Copa Bar & Grill – 217 Congress Ave, Austin, TX 78701
Details: Join Infochimps at Copa Lounge to celebrate the midway mark of SXSW Interactive and the count down to CDW’s 3rd annual Chicago conference. Come enjoy free food and beverages while stopping by our table to talk Big Data. This event is hosted by CDW, free, and open to SXSW badgeholders who are 21+.
For More Info >>

When: Tuesday, March 12 @ 9:30a-1:30p CT
Where: AT&T Conference Center – 1900 University Drive, Classroom 105, Austin, Texas 78705
Details: In this workshop, everyone will set up their own Ironfan deployment environment, stand up some simple servers to demonstrate the basics of Ironfan and Silverware (our core deployment libraries), and then dive into the deep end with a Hadoop deployment (and more, if time allows). This workshop is hosted by Infochimps, free, and open to SXSW badgeholders who RSVP.
Register Now >>

When: Saturday, March 9; Sunday, March 10; Tuesday, March 12 @ 9:30-10:30a CT
Where: Four Seasons, 98 San Jacinto Blvd, Austin, Texas 78701
Details: A new addition to the 2013 lineup, SXSW Home Rooms are where you get the most current information about all the presentations, panels, and networking opportunities for the day ahead. For these particular days at the Four Seasons, meet Hollyann Wood, Infochimps Office Manager. These Home Rooms are hosted by SXSWi, free, open to SXSW badgeholders, and registrations are not required.
For More Info >>

Looking forward to seeing you during SXSW!

6fefa857 2e95 4742 9684 869168ac7099 SXSW Events: Free Food, Free Drinks, Free Swag

Streaming Data, Aggregated Queries, Real-Time Dashboards

Some customers have a large volume of historical data that needs to be processed in our Cloud::Hadoop. Others are trying to power data-driven web or mobile applications with our Cloud::Queries powered by a scalable, NoSQL database such as HBase or Elasticsearch.

But there’s one use case that keeps popping up across our customers, industries and across nearly all use cases: streaming aggregation of high-throughput data to be used to power dynamic customer-facing applications and dashboards for internal business users.

Why is Streaming Aggregation Such a Challenge?

Here are a couple of example use cases that demand of the streaming aggregation use case:

  • Retail: You have 10s of millions of customers and millions of products. Your daily transaction volumes are enormous (e.g. up to 10M events per second for some of our bigger online retailers) but they’re also at a very fine a level of detail. When reporting, you want to see data aggregated by product or by customer so you can do trending, correlation, market basket, etc., kinds of analyses.
  • Ad Tech: You generate 100s of millions of ads, pixels, impressions, clicks, conversions, each day. It’s uninteresting to track each event separately; you care about how a particular advertiser or campaign is doing. You need to provide real-time dashboards, which show performance and the value of your service to your advertisers over a dataset which can be queried ad-hoc or interactively.

Sound familiar? Do you:

  • Have more than 1 M+ new records per day delivered continuously (~10K new records / sec)? This is when things begin to get interesting.
  • Aggregate the input data on a subset of its dimensions? Say 100 dimensions?
  • Store the aggregated inputs for several months or years? So that you can analyze trends over time?
  • And, demand the ability to create dashboards or use business intelligence tools to slice and dice this data in a variety of ways?
  • Would you like to have the original input records available when you need them? Just in case your questions change later?

If you answered yes to some or all of these questions, then you need to investigate the stream aggregation services offered by Infochimps Cloud for Big Data.

But before we get into the benefits of using Infochimps’ cloud services to solve this problem, let me first describe some other approaches and why they ultimately can fail (see our recent survey here>>).

The Traditional Approach

The Traditional Approach1 Streaming Data, Aggregated Queries, Real Time Dashboards

The traditional approach to solving a streaming aggregation problem leverages only the traditional 3-tier web application stack of web client (browser), web/application server, and SQL database.

Many organizations start out with this technology when their applications are still new. Their initial success leads to growth, which leads to more input data, which leads to users and partners demanding more transparency and insight into their product: so BI dashboards become an important aspect of managing your business effectively.

The traditional web stack provides enough data processing power during the early days, but as data volumes grow, the process of dumping raw records into your SQL database and aggregating them once nightly no longer scales.

24 hours of data from the previous day starts taking 3-4 then 7-8, then 12-13 hours to process. Ever experience this? A problem almost over 300 IT professionals told us about in a recent survey, had to do with this issue of a nightly aggregation step that, many times, leads to many days of frustrating downtime or critical delays in the business or in the worst case a situation where you simply never can fix this scaling issue referred to as the “horizon of futility” — the moment when the amount of time taken to aggregate a given amount of data is equal to the amount of time taken to generate that data.

Are you using challenged with this scenario? Do you:

  • Rely on an SQL database like Oracle, SQL Server, MySQL, or PostgreSQL to store all your data?
  • Use this same database to calculate your aggregates in a (nightly?) batch process?
  • Grudgingly tolerate the slow down in the performance of the application during periods in which your database is executing your batch jobs?
  • Have data losses or long delays between input data and output graphs?
  • Feel overburdened by the operations workload of pushing the 3-tier technology stack?

If so, maybe you’ve already taken some of evolutionary steps using new webscale technologies such as Hadoop…

Half-Hearted Solution

Half a Solution2 Streaming Data, Aggregated Queries, Real Time Dashboards

Or should we say that your heart is in the right place, but the solution still falls short of expectations. Organizations confronted with streaming aggregation problems usually correctly identify one of the symptoms of lack of scalability in their infrastructure. Unfortunately, they often choose an approach to scale which is already known to them or easy to hire for: scale up your webservers and your existing SQL database(s), and then add Hadoop!

They make this choice because it is easy and it is incremental. Adding “just a little more RAM” to an SQL database may sound like the right approach, may often work just fine in the truly early days, but soon becomes unmanageable as the figure of merit — speedup in batch job per dollar spent (aka price-performance) on RAM for the database  becomes lower and lower as data volumes increase. This becomes even more costly as the organization needs to scale up resources to just “keep the lights on” with such an infrastructure.

Scaling of web services is often handled by spawning additional web servers (also referred to as ‘horizontally scaling’), which is a fine solution for the shared-nothing architecture of a web application. This approach, when applied to critical analytic data infrastructure, leads to the “SQL database master-slave replication and sharding” scenario that is supported by so many DBAs in the enterprise today.

What About Hadoop?

Confronted with some of these problems, organizations will often start attending Big Data conferences and learn about Hadoop, a batch processing technology at the very tip of the Big Data spear. This leads either to a search for talent where organizations quickly realize that Hadoop engineers and sys admins are incredibly rare resources; or internal teams get pulled from existing projects to build the “Hadoop cluster”. These are exciting times for internal staff, until after a period of time where the organization has a functioning Hadoop cluster, albeit at a great internal operations cost and after many months of critical business delay. This Hadoop cluster may even work, happily calculating aggregate metrics from data collected in streams the prior day, and even at orders of magnitude faster than with the Traditional Approach above.

Organizations who arrive at this point in their adoption of Big Data infrastructure then uneasily settle into believing they’ve solved their streaming aggregation problem with a newfangled batch-processing system with Hadoop. But many folks in the organization will then realize that:

  • They are spending too much time on operations and not enough time on product or business needs as engineering struggles with educating the organization on how to use these new technologies it doesn’t understand.
  • They are still stuck solving a fundamentally real-time problem with a batch-solution.
  • Their sharded approach is only delaying the inevitable.

How Does Facebook, Twitter, Linkedin, etc. Do It?

Multiple Applications Streaming Data, Aggregated Queries, Real Time Dashboards

It’s not surprising that Hadoop is the first Big Data technology brought in by many organizations. Google and then Yahoo! set the stage. But what they didn’t tell you was that is “yesterday’s approach”. So how do webscale companies like Facebook and the like to things today? Yes, Hadoop is powerful and it’s been around longer than many other Big Data technologies, and it has great PR behind it. But Hadoop isn’t necessarily the (complete) answer to every Big Data problem.

The streaming aggregation problem is by its nature real-time.  An aggregation framework that works in real-time is the ideal solution.

Infochimps Cloud::Streams provides this real-time aggregation because it is built on top of leading stream processing frameworks used by leaders TODAY.  Records can be ingested, processed, cleaned, joined, and — most importantly for all use cases — aggregated into time and field based bins in real-time: the bin for “this hour” or “this day” contain data from this second.

This approach is extremely powerful for solving the use cases defined above because:

  • Aggregated data is immediately available in downstream data stores and analysis (do you care to act on data now or hours, days, later?).
  • Raw data can be written to the same or a number of data stores for different kinds of processing to occur later. Not every data store is equal. You may need several to accommodate the organizations needs.
  • By NOT waiting for a batch job to complete means that data pipeline or analytics errors are IMMEDIATELY detected as they occur — and immediately recovered from — instead of potentially adding days of delay due to the failure of long-running batch jobs.
  • Ingestion and aggregation are decoupled from storing and serving historical data so applications are more robust.

Infochimps Cloud and its streaming services is more than just a point product: it’s a suite of data analytics services addressing your streaming, ad-hoc/interactive query, and batch analytics needs all in an integrated solution that you can take advantage of within 30 days. It is also offered as a private cloud service managed by dedicated support and operations engineers who are experts at Big Data.  This means you get all the benefits of Big Data technologies without having to bear the tremendous operations burden they incur.

What Comes Next?

We covered how Hadoop isn’t a good solution to the streaming aggregation problem but that doesn’t mean it isn’t useful.  On the contrary, long-term historical analysis of raw data collected by a streaming aggregation service is crucial to developing deeper insights than are available in real-time.

That’s why the Infochimps Cloud for Big Data also includes Hadoop.  Collect and aggregate data in real-time and then spin up a dynamic Hadoop cluster every weekend to process weekly trends.  The combination of real-time responsiveness and insight from long-time-scale analysis creates a powerful approach to harnessing a high throughput stream of information for business value.

Dhruv Bansal is the Chief Science Officer and Co-Founder of Infochimps, He holds a B.A. in Math and Physics from Columbia University in New York and attended graduate school in Physics at the University of Texas at Austin. For more information, email Dhruv at or follow him on Twitter at @dhruvbansal.

119efc1b cf09 4f4f 9085 057e76e0464c Streaming Data, Aggregated Queries, Real Time Dashboards