Big Data News

Big Data’s Evolution: 5 Things That Might Surprise You

Evolution of Big Data Big Data’s Evolution: 5 Things That Might Surprise You  Over the past several years, Big Data has gone from being a somewhat obscure concept to a genuine business buzzword. As is often the case with buzzwords, when you dig a little deeper you find that many people have substantial misconceptions about what Big Data is, where it came from and where it is going.

Here are a few things that might surprise you about the evolution of Big Data:

  1. There are more “failures” out there than you’d think. We’re bombarded with the hype, but the reality is that this is still an early technology. As people are unfamiliar with the tech components of Big Data, they’re often prone to thinking that they can jump in and do everything themselves. However, the task of streaming and analyzing batch, near-real-time and real-time data in a comprehensible form is beyond the capabilities of most in-house IT departments, and will require outside expertise.

  2. It is an evolution, not a revolution. The topic of Big Data has exploded so quickly onto the media landscape that it’s easy to get the impression that it appeared from nowhere and is transforming business in a whirlwind. While the ultimate impact Big Data will have on business cannot be underestimated, its ascension has been much more incremental than media coverage might lead you to believe. Its earliest stages began more than a decade ago with the increasing focus on unstructured data, and since then companies have been steadily experimenting with and building capabilities and best practices. It’s important to make the distinction between evolution and revolution because viewing Big Data as revolutionary may lead to the temptation to dive in headlong without a real plan. The smart course of action involves identifying a very specific business challenge that you’d like to address with Big Data, and then expanding and iterating your program step-by-step.

  3. Big Business doesn’t yet ‘get’ Big Data. You’d think big enterprises would have captured a 360-degree view of their customers by now. But they haven’t, and evidence of this abounds in the sloppy marketing outreach efforts that everyone experiences on a daily basis. Two essential changes need to happen in order for enterprises to truly get a handle on Big Data:  1) corporations need to break down departmental silos and freely share customer insights organization-wide; and 2) they must start bringing the apps to the data, rather than bringing their data to the apps. Companies have been reluctant to embrace the cloud for sensitive, proprietary data due to security and other reasons. However, we now have the ability to build apps in virtual private clouds that reside in tier-4 data centers, eliminating the need for the expensive, risk-laden migrations that have stood in the way of enterprises’ ability to adopt effective Big Data strategies.

  4. Housing your own data is too cost-prohibitive. The old ways of doing things simply won’t work for Big Data — it’s just too big. While 10TB of legacy infrastructure costs in excess of $1M to store, the data warehouse for any significant company is going to be way past 20 TB. The math isn’t difficult —  housing your own data is super expensive. There’s no way that companies like Facebook and LinkedIn, for whom customer data is lifeblood, could have done it without leveraging the cloud. More and more, enterprises are discovering that they can achieve analytic insights from Big Data by deploying in the cloud.

  5. Hadoop alone won’t do it. Although Hadoop gets 80% of the Big Data attention, it’s only 20% of the solution. Predicting customer behavior is kind of like shooting a bullet with another bullet, and is going to require much more than a historical data perspective. Sure, Hadoop gets most of the press these days, but in order for enterprises to gain a truly customer-centric view they’ll need to tie together historical, real-time and near real-time data through a single, user-friendly interface that enables them to analyze and make decisions in-flight.

Dave Spenhoff is the VP of Marketing at Infochimps. He has helped emerging companies, including Inxight and MarkLogic, define and launch successful go-to-market strategies and has provided strategic marketing guidance to a number of technology companies.

6fefa857 2e95 4742 9684 869168ac7099 Big Data’s Evolution: 5 Things That Might Surprise You

Streaming Data, Aggregated Queries, Real-Time Dashboards

Some customers have a large volume of historical data that needs to be processed in our Cloud::Hadoop. Others are trying to power data-driven web or mobile applications with our Cloud::Queries powered by a scalable, NoSQL database such as HBase or Elasticsearch.

But there’s one use case that keeps popping up across our customers, industries and across nearly all use cases: streaming aggregation of high-throughput data to be used to power dynamic customer-facing applications and dashboards for internal business users.

Why is Streaming Aggregation Such a Challenge?

Here are a couple of example use cases that demand of the streaming aggregation use case:

  • Retail: You have 10s of millions of customers and millions of products. Your daily transaction volumes are enormous (e.g. up to 10M events per second for some of our bigger online retailers) but they’re also at a very fine a level of detail. When reporting, you want to see data aggregated by product or by customer so you can do trending, correlation, market basket, etc., kinds of analyses.
  • Ad Tech: You generate 100s of millions of ads, pixels, impressions, clicks, conversions, each day. It’s uninteresting to track each event separately; you care about how a particular advertiser or campaign is doing. You need to provide real-time dashboards, which show performance and the value of your service to your advertisers over a dataset which can be queried ad-hoc or interactively.

Sound familiar? Do you:

  • Have more than 1 M+ new records per day delivered continuously (~10K new records / sec)? This is when things begin to get interesting.
  • Aggregate the input data on a subset of its dimensions? Say 100 dimensions?
  • Store the aggregated inputs for several months or years? So that you can analyze trends over time?
  • And, demand the ability to create dashboards or use business intelligence tools to slice and dice this data in a variety of ways?
  • Would you like to have the original input records available when you need them? Just in case your questions change later?

If you answered yes to some or all of these questions, then you need to investigate the stream aggregation services offered by Infochimps Cloud for Big Data.

But before we get into the benefits of using Infochimps’ cloud services to solve this problem, let me first describe some other approaches and why they ultimately can fail (see our recent survey here>>).

The Traditional Approach

The Traditional Approach1 Streaming Data, Aggregated Queries, Real Time Dashboards

The traditional approach to solving a streaming aggregation problem leverages only the traditional 3-tier web application stack of web client (browser), web/application server, and SQL database.

Many organizations start out with this technology when their applications are still new. Their initial success leads to growth, which leads to more input data, which leads to users and partners demanding more transparency and insight into their product: so BI dashboards become an important aspect of managing your business effectively.

The traditional web stack provides enough data processing power during the early days, but as data volumes grow, the process of dumping raw records into your SQL database and aggregating them once nightly no longer scales.

24 hours of data from the previous day starts taking 3-4 then 7-8, then 12-13 hours to process. Ever experience this? A problem almost over 300 IT professionals told us about in a recent survey, had to do with this issue of a nightly aggregation step that, many times, leads to many days of frustrating downtime or critical delays in the business or in the worst case a situation where you simply never can fix this scaling issue referred to as the “horizon of futility” — the moment when the amount of time taken to aggregate a given amount of data is equal to the amount of time taken to generate that data.

Are you using challenged with this scenario? Do you:

  • Rely on an SQL database like Oracle, SQL Server, MySQL, or PostgreSQL to store all your data?
  • Use this same database to calculate your aggregates in a (nightly?) batch process?
  • Grudgingly tolerate the slow down in the performance of the application during periods in which your database is executing your batch jobs?
  • Have data losses or long delays between input data and output graphs?
  • Feel overburdened by the operations workload of pushing the 3-tier technology stack?

If so, maybe you’ve already taken some of evolutionary steps using new webscale technologies such as Hadoop…

Half-Hearted Solution

Half a Solution2 Streaming Data, Aggregated Queries, Real Time Dashboards

Or should we say that your heart is in the right place, but the solution still falls short of expectations. Organizations confronted with streaming aggregation problems usually correctly identify one of the symptoms of lack of scalability in their infrastructure. Unfortunately, they often choose an approach to scale which is already known to them or easy to hire for: scale up your webservers and your existing SQL database(s), and then add Hadoop!

They make this choice because it is easy and it is incremental. Adding “just a little more RAM” to an SQL database may sound like the right approach, may often work just fine in the truly early days, but soon becomes unmanageable as the figure of merit — speedup in batch job per dollar spent (aka price-performance) on RAM for the database  becomes lower and lower as data volumes increase. This becomes even more costly as the organization needs to scale up resources to just “keep the lights on” with such an infrastructure.

Scaling of web services is often handled by spawning additional web servers (also referred to as ‘horizontally scaling’), which is a fine solution for the shared-nothing architecture of a web application. This approach, when applied to critical analytic data infrastructure, leads to the “SQL database master-slave replication and sharding” scenario that is supported by so many DBAs in the enterprise today.

What About Hadoop?

Confronted with some of these problems, organizations will often start attending Big Data conferences and learn about Hadoop, a batch processing technology at the very tip of the Big Data spear. This leads either to a search for talent where organizations quickly realize that Hadoop engineers and sys admins are incredibly rare resources; or internal teams get pulled from existing projects to build the “Hadoop cluster”. These are exciting times for internal staff, until after a period of time where the organization has a functioning Hadoop cluster, albeit at a great internal operations cost and after many months of critical business delay. This Hadoop cluster may even work, happily calculating aggregate metrics from data collected in streams the prior day, and even at orders of magnitude faster than with the Traditional Approach above.

Organizations who arrive at this point in their adoption of Big Data infrastructure then uneasily settle into believing they’ve solved their streaming aggregation problem with a newfangled batch-processing system with Hadoop. But many folks in the organization will then realize that:

  • They are spending too much time on operations and not enough time on product or business needs as engineering struggles with educating the organization on how to use these new technologies it doesn’t understand.
  • They are still stuck solving a fundamentally real-time problem with a batch-solution.
  • Their sharded approach is only delaying the inevitable.

How Does Facebook, Twitter, Linkedin, etc. Do It?

Multiple Applications Streaming Data, Aggregated Queries, Real Time Dashboards

It’s not surprising that Hadoop is the first Big Data technology brought in by many organizations. Google and then Yahoo! set the stage. But what they didn’t tell you was that is “yesterday’s approach”. So how do webscale companies like Facebook and the like to things today? Yes, Hadoop is powerful and it’s been around longer than many other Big Data technologies, and it has great PR behind it. But Hadoop isn’t necessarily the (complete) answer to every Big Data problem.

The streaming aggregation problem is by its nature real-time.  An aggregation framework that works in real-time is the ideal solution.

Infochimps Cloud::Streams provides this real-time aggregation because it is built on top of leading stream processing frameworks used by leaders TODAY.  Records can be ingested, processed, cleaned, joined, and — most importantly for all use cases — aggregated into time and field based bins in real-time: the bin for “this hour” or “this day” contain data from this second.

This approach is extremely powerful for solving the use cases defined above because:

  • Aggregated data is immediately available in downstream data stores and analysis (do you care to act on data now or hours, days, later?).
  • Raw data can be written to the same or a number of data stores for different kinds of processing to occur later. Not every data store is equal. You may need several to accommodate the organizations needs.
  • By NOT waiting for a batch job to complete means that data pipeline or analytics errors are IMMEDIATELY detected as they occur — and immediately recovered from — instead of potentially adding days of delay due to the failure of long-running batch jobs.
  • Ingestion and aggregation are decoupled from storing and serving historical data so applications are more robust.

Infochimps Cloud and its streaming services is more than just a point product: it’s a suite of data analytics services addressing your streaming, ad-hoc/interactive query, and batch analytics needs all in an integrated solution that you can take advantage of within 30 days. It is also offered as a private cloud service managed by dedicated support and operations engineers who are experts at Big Data.  This means you get all the benefits of Big Data technologies without having to bear the tremendous operations burden they incur.

What Comes Next?

We covered how Hadoop isn’t a good solution to the streaming aggregation problem but that doesn’t mean it isn’t useful.  On the contrary, long-term historical analysis of raw data collected by a streaming aggregation service is crucial to developing deeper insights than are available in real-time.

That’s why the Infochimps Cloud for Big Data also includes Hadoop.  Collect and aggregate data in real-time and then spin up a dynamic Hadoop cluster every weekend to process weekly trends.  The combination of real-time responsiveness and insight from long-time-scale analysis creates a powerful approach to harnessing a high throughput stream of information for business value.

Dhruv Bansal is the Chief Science Officer and Co-Founder of Infochimps, He holds a B.A. in Math and Physics from Columbia University in New York and attended graduate school in Physics at the University of Texas at Austin. For more information, email Dhruv at or follow him on Twitter at @dhruvbansal.

119efc1b cf09 4f4f 9085 057e76e0464c Streaming Data, Aggregated Queries, Real Time Dashboards

CIOs & Big Data: What IT Teams Want Their CIOs to Know

It’s no secret that enterprises today face an increasingly competitive and erratic global business environment, and that Big Data is more than just another IT project – it’s truly a finger on the pulse of the business. To say that in 2013 Big Data is “mission critical” is to put it mildly – organizations that ignore the insights that Big Data can deliver are flying blind. So, it is all the more disconcerting that 55% of Big Data projects don’t get completed, and many others fall short of their objectives.

In order to understand the reasons for this, Infochimps partnered with, one of the largest enterprise technology-focused, community-driven sites and a source for answers to IT-related questions and professional growth for more than 570,000 members. Together we got survey responses from over 300 IT department staffers – 58% of whom have current Big Data projects underway – on what they most wanted their CIOs to know about the process of implementing Big Data projects.

Read the full report here. >>

Key findings are summarized in the following infographic:
SurveyInfographic Final CIOs & Big Data: What IT Teams Want Their CIOs to Know

While the findings reveal many reasons for Big Data project failure, undoubtedly one of the biggest factors is lack of communication between top managers, who provide the overall project vision, and the data scientist and other IT staff charged with actually implementing it. Far too frequently their opinions are taken as an afterthought, and consequently considered only when projects veer off-course.

Given the stakes, it’s imperative that CIOs have a 360-degree view of all that a Big Data project will involve – not just the various Big Data technologies that are so frequently at the forefront of Big Data discussions.

The insight we gleaned reveals much about both enterprise technology and enterprise culture. In order for companies to succeed with Big Data, executives will need to rethink long-held notions of how diverse departments should function together. In the past “breaking down silos” was a nice mantra. Now, it is imperative. Additionally, CIOs and other enterprise executives may find it necessary to educate their organizations on the advantages of new Big Data applications and processes that will give them better customer insights, make their jobs infinitely easier and give their departments the elasticity needed to meet virtually any business need in real-time.

We hope this report will serve not only as a source of insight, but also be a reminder to seek the invaluable perspective of IT staff as early as possible in the process of developing new, technology-intensive projects.

Read the press release here. >>


Intelligent Applications: The Big Data Theme for 2013

Intelligent Applications the Big Data theme for 2013 Intelligent Applications: The Big Data Theme for 2013

My prediction for 2013 is that competitive advantage will translate into enterprises using sophisticated Big Data analytics to create a new breed of applications – Intelligent Applications.

“It’s more than just insights from MapReduce”, a CIO from a fortune 100 told me, “It’s about using data to make our customer touch points more engaging, more interactive, more intelligent.”

So when you hear about “Big Data solutions”, you need to translate that into a new category of “Intelligent Applications”. At the end of the day, it’s not about people pouring through petabytes of data. It’s actually about how one turns the data into revenue (or profits).

This means that you MUST:

  1. Start with the business problem first (preferably one with revenue upside versus cost savings)
  2. Determine which data elements you can leverage AFTER #1
  3. Define an analytical three-tier architecture (as shown above)

Which Big Data market segments will grow the fastest in 2013?

Morgan Stanley named the top ten as follows:

  1. Healthcare
  2. Entertainment
  3. Com/Media
  4. Manufacturing
  5. Financial
  6. Business Services
  7. Transportation
  8. Web Tech
  9. Distribution
  10. Engineering

Many have predicted which Industry is the most attractive (see McKinsey’s Quarterly for another). I personally like Ad-Tech and Financial Services for verticals….followed by Information Management , Health (if you can partner to speed up sales cycles), and Communications.

But what about market segments by technology?

The Growth of Cloud Based Big Data Intelligent Applications: The Big Data Theme for 2013

I predict that Data Analytics as a Service (or also referred to as Big Data as a Service (BDaaS)) will have the highest growth (obviously building from a small base in revenue given its level of maturity). Business Intelligence as a Service is the next high-growth segment, given the need for easier ways to present and visualize data, followed by Logging as a Service.

But don’t take my word for this….my data comes from prominent research organizations. I’m just compiling and presenting their data in a slightly new way.

What challenges will end-user organizations struggle with the most in 2013?

End-users will continue to struggle with making sense out of the many technologies available. Is it EMC Greenplum connected to EMC Hadoop? Is it Cloudera Impala + Hadoop? Is it AsterData + Hortonworks? Is it MapR Hbase + HDFS? I think one thing is definite….you have lots of options.

The biggest problem will be whether they are actually satisfying the needs of the business problem. Here are my leading predictions for end-user organizations:

  1. End users just want to solve problems, but will continue to fight IT over who owns the platform powering their much-needed data-driven applications
  2. Ultimately, end-users will be forced to chase “shinny objects” because IT groups will persuade them to wait for the “technology bake-offs” around the Big Data platform soon to be launched (24 months from now)
  3. In the end, many organizations will fail at creating value from Big Data due to a lack of focus on business problems, time-to-market, and in some cases the wrong technology choice

What are some of the key technologies that will dominate the Big Data market in 2013?

So many equate Big Data with Hadoop. But as you begin to see with announcements like Impala from Cloudera, it’s more than just Hadoop. It’s about servicing all the application response time requirements. It’s about volume, velocity, and variety but also time-to-value with your data analytics.

My prediction for 2013 is that you will need the following technology components:

  • Real-time stream processing
  • Ad-hoc near real-time analytics (see NoSQL and NewSQL data stores)
  • Batch Analytics

Not one, but all three!

What steps can customers take to maximize competitive advantage with Big Data in 2013?

Competitive advantage is ALL about time-to-market. I have no doubt that every Global 2000 company will launch their Big Data initiatives in 2013. The question is when they will turn those initiatives into additional revenue…how long will it take from the time that they hire Accenture, CSC, Capgemini, IBM or the like to implement their Big Data strategies, to launching an intelligent application?

My prediction for 2013:

Cloud will become a large part of big data deployment – established by a new cloud ecosystem.

This will be driven by the need for time-to-market and ultimately, competitive advantage. Cloud usually lags any disruption made behind the firewall….by at least 12 months. In the case of Big Data, the launch of Apache 1.0 in December of 2011 basically makes 2013 the year for Cloud-based Big Data.

That being said, large volumes of data, privacy and public cloud are not usually mentioned in the same paragraph by IT in a Global 2000 enterprise. That’s why we’re going to see elastic big data clouds behind the firewall and within trusted third party data center providers.

119efc1b cf09 4f4f 9085 057e76e0464c Intelligent Applications: The Big Data Theme for 2013

Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

forbes gil press jim kaskade Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

Today Gil Press, blogger at ForbesWhat’s The Big Data?, and The Story of Information, published his thoughts on an interview with our new CEO Jim Kaskade, titled “Infochimps’ New CEO on the Next Big Data Acquisition and Getting Rid of Data Scientists.”

Some quotes:

  • “CIOs are ready to embrace open source big data software and that the established IT players, lacking open source experience, will have to buy their way into the market.”
  • “As an engineer with Teradata in the 1990s, he witnessed first-hand what I call the Small Big-Data Bang and as a result, can draw interesting parallels with today’s Big Big-Data Bang.”
  • “Get rid of the data scientists? ‘The politically correct way to say it,’ says Kaskade, ‘is that I will turn your business users and application developers into data scientists…”

Read the article.

Interested in reading more about Jim’s vision of The Data Era? Jim’s first blog post with Infochimps, The Data Era – Moving from 1.0 to 2.0, provides an inside look into “why Infochimps is so well positioned to make a significant impact within the marketplace”.

See other media coverage:

Much gratitude to Gil Press and to Forbes.

blog platform demo v21 Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

The Impact of a Nationwide Drought

OB TY034 PARCHE G 20120727173419 The Impact of a Nationwide Drought

According to a recent report from the Wall Street Journal, more than half of the United States is dry.  Insufficient rainfall and soaring temperatures have left much of the country ravaged with severe crop damage.  The latest US Drought monitor indicates that 20% of the country is facing extreme or exceptional drought conditions, up 7% from just one week ago.  Perhaps it is time that the country as a whole take a hard look at solutions, such as Tom Mason’s Water Plan.

Big Data for Retail is a Hot Product

future of retail Big Data for Retail is a Hot Product

Check out this guest post in Forbes from the VP of Product Marketing from SAP.  With recent customer wins, including retail technology company, BlackLocus, we are very familiar with the growing trend of retailers looking to Big Data to solve a variety of business challenges, including identifying lost sales, improving transport logistics, and better anticipating customer needs.

Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

An old 1978 ad slogan from Scrubbing Bubbles stated that, “We work hard so you don’t have to” – essentially promising customers that they would take care of the “dirty work” and let the customer reap the benefit of the clean, finished product.

The same holds true today here at Infochimps. Our mission is to do the heavy lifting and seamlessly handle your big data implementations –removing the requirement for expensive integration or specialists–allowing you to focus on generating insights from data, not managing Big Data infrastructure. We provide you the insights you need to make data-driven decisions, speed your application development, and, ultimately, improve your operational efficiencies and time to market.

We are proud to announce today the latest version of our Big Data Platform, a managed, fully optimized and hosted service for deploying Big Data environments and apps in the cloud, on-demand.

Key new features include:

New Data Delivery Service

  • Based on the open source Apache Flume project
  • Integrates with your existing data environment and data sources with a combination of out-of-the-box and custom connectors
  • High scalability and optimization of distributed ETL (Extract, Transform, Load) processes, able to handle many terabytes of data per day
  • Both distributed real-time analysis and distributed Hadoop batch analysis

Real-time, Data Streaming Framework

  • You can use familiar programming languages such as Ruby to vastly simplify performing real-time analytics and stream processing, all within the Infochimps Data Delivery Service
  • Extends Infochimps’ Wukong open source project, which lets developers use Ruby micro-scripts to perform Hadoop batch processing

We’ve bundled together everything you need to install the platform, making it faster than ever to get a big data project off the ground — a configured solution can be deployed in just a few hours.

The Infochimps Platform is capable of executing on hundreds of data sources and many terabytes of data throughput, delivering scalability to any type and quantity of database, file system or analytic system.

Check out our other new features in today’s press release.

Also, read GigaOm’s take on our news here.

blog platform demo v21 Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

The Big Data Playbook for Digital Agencies

iStock 000007316552XSmall The Big Data Playbook for Digital Agencies

Our CEO, Joseph Kelly recently wrote a guest post for Mashable about how and why digital agencies should pursue Big Data opportunities.

For digital agencies, big data as a competitive advantage is still very nascent, somewhat terrifying, and not tangible at all. However, marketers are starting to hear that it’s the new secret sauce, and they’re scrambling to figure out how to use it. And for good reason. Given the current trajectory, there’s a large chance that big data will change the face of digital agencies in as little as five years.

If you’re a marketer, part of a digital agency, or just curious about how Big Data can and will shape the future of understanding customers, check out this article.

 The Big Data Playbook for Digital Agencies

Big Data for Big Weather Predictions

umbrella1 Big Data for Big Weather Predictions

The almost infinite applications of Big Data have been well documented in this blog, from identifying neighborhoods to empowering digital agencies to uncovering health trends to revolutionizing the dairy industry.  Last week, venture capitalists made a big bet on a new use for Big Data, an agricultural insurance policy protecting farmers from extreme weather, or more simply put, weather insurance.

Yup, we’ll write it again, just so you believe us – weather insurance.  The Climate Corporation just received $50 million in funding, in order to hire 50 data scientists, software engineers, and quantitative researchers.

The Climate Corp. sells something called Total Weather Insurance, which pays local farmers when they are impacted by weather events that affect their profits. The company uses a massive cloud-driven data analytics service to predict the possibility of extreme weather, along with the potential impact. It prices its insurance policies accordingly based on that information.