Products & Features

Democratizing Big Data: We Get It

We’ve been there.

Maybe you’re an enterprise with huge data sets, competing in a saturated market like telecommunications, healthcare or financial services.

Or maybe you’re a startup that has lots of data but not the manpower to handle the data.

Or maybe you’re a retailer moving from multichannel to omnichannel, but you’re struggling to synthesize data from disparate sources, like legacy point-of-sale systems and Foursquare check-in data.

Or maybe you’re something else entirely, for whom the promise of Big Data seems like a pipe dream because:

  • The infrastructure hurdle is a towering one:
    • How do you acquire, store and manage all that data?
    • How do you integrate tools like Hadoop, Storm, Kafka, NoSQL, and others in ways that produce transformational insights for your business?
    • How do you plan a technology stack with the elasticity to scale for use cases known and unknown?
  • You’re understaffed for the project you’re considering, and you can’t afford to poach experts from Big Data pedigree houses like Google, LinkedIn, Twitter, etc.
  • The clock is ticking. No one is giving you a year to go on a Big Data fishing expedition. You need quick initial results and even quicker iterations.

Ok, deep breaths. We get it. We’ve been there.

Infochimps’ growth from data marketplace to Platform as a Service (PaaS) makes us a highly evolved group of chimps: The kind who can take any kind of data, and do any kind of analytics with it, in any type of cloud. We’ve worked with every kind of database. We can produce batch, streaming and ad hoc analytics. And we can deploy from public, private and hybrid clouds.

And we do it quickly. While typical Big Data projects take over a year to yield results, we can have your first use case in production in 90 days, and complete subsequent projects in weeks.

Our approach to Big Data is built on Infochimps™ Cloud for Big Data: three essential cloud services that unleash the full analytic capabilities needed to solve any enterprise Big Data problem. Infochimps Cloud expedites and simplifies development and deployment of Big Data applications.

So if you’re late to the Big Data game or you’ve been beaten in it before, let’s talk. Infochimps can save your organization hardware and hiring costs, while accelerating results – enabling you to unlock insights that can transform your business.

6fefa857 2e95 4742 9684 869168ac7099 Democratizing Big Data: We Get It

Announcing Infochimps Cloud 3.2

AllCloudServices 1024x515 Announcing Infochimps Cloud 3.2

Moving petabytes or even hundreds of terabytes of data to the public cloud can be costly and time consuming work. Since its conception, the goal of the Infochimps Cloud has been to provide the elasticity, scalability, and resiliency of cloud-based big data infrastructure, but in any environment you choose. That may mean the public cloud such as Amazon Web Services, but that may also mean a virtual private cloud, an outsourced data center such as Switch SuperNAP, or your own internal corporate data center.

With the latest release of the Infochimps Cloud, we’re excited to fully realize that vision of easily to moving your data analytics solution to your data, not just moving data to your analytics solution.

Full Private Cloud Support
Infochimps provides not only provides analytics cloud services, but also virtualization integration. With this newest release, the Infochimps Cloud fully integrates with VMware® vSphere®. This integration empowers customers to deploy the full Infochimps stack internally, leveraging their own data center and their own hardware, and either their own VMware software or an integrated Infochimps + VMware solution.

private cloud deploy options 1024x505 Announcing Infochimps Cloud 3.2

This virtualization integration framework, powered by Ironfan and Chef, enables the Infochimps Cloud to deploy to any data center where hardware and virtualization are available. For example, Infochimps partner Switch has a Tier 4 facility in Las Vegas with a 100% data center uptime guarantee, where Infochimps can quickly and seamlessly deploy big data solutions that have unrivaled reliability and high availability.

Ultimate in Cloud Mobility
One of the amazing differentiators of utilizing Infochimps Cloud is the concept of cloud mobility. Start in one environment, such as Amazon Web Services, to quickly build your application and provide a development and testing platform for your team. At any time, you can quickly migrate both your cloud services infrastructure and your big data application logic to a different environment, such as SuperNAP or your internal data center, for your final production application.

cloud hybrid and migration 1024x201 Announcing Infochimps Cloud 3.2

This is enabled by both the Ironfan homebase and application Deploy Pack frameworks, which provide folder structures to encapsulate your infrastructure and application code, and seamlessly allow them to plug into different hardware and different cloud services nodes respectively.

While this capability makes a lot of sense for applications that have sensitive data or security concerns, this also is extremely useful when customers want to get started as quickly as possible. Infochimps can turnover a completely configured Amazon Web Services environment in just a few days, developers and analysts can begin cranking away, and simultaneously a data center environment can be prepped for the eventual second stage of infrastructure deployment.

Improved Developer and Data Scientist Tools
Also major improvements have been made the user experience in working with Infochimps Cloud platform.

Wukong 3.0 is the latest DSL and command line toolkit for rapid big data application development:

  • Updated wukong-hadoop for writing Hadoop Streaming jobs with simple micro-scripts
  • All new wukong-storm for taking your Wukong flows (stitching together data sources, “processors,” and data destinations) and deploying them as Storm topologies
  • All new wukong-deploy for quickly generating Deploy Packs for encapsulating your application logic that can be tested locally, then be deployed to your Infochimps Cloud solution

The Infochimps Cloud API has been enhanced for more cross-platform functionality:

  • Unified monitoring metrics are available for understanding what is happening within the platform
  • It’s even simpler to store configuration values and settings, which can be utilized by any of your applications across the various Infochimps cloud services

To learn more about the Infochimps Cloud and the latest enhancements, request a demo today.

Breaking Hadoop out to the Larger Market

There are a lot of people out there with a Terabyte problem but who lack a Petabyte problem — yet they are forced to try to make use of a stack developed to address Facebook, Yahoo and JP Morgans‘ Petabyte problem. Hadoop out of the box is oriented for achieving 100% utilization of fixed-sized clusters by 12, 50, 100+ person analytics teams. In contrast, the bulk of even forward-thinking enterprises are at the level of just having handed two PhD statisticians a copy of the elephant book, a mis-provisioned cluster, and a slap on the back with a directive to “go find us som’a that insight!”.

There are a few observations we’ve made about these other customers and their differentiated needs that I wanted to share, and point to how we seek to address these with our own product.

Our first major observation is that while Hadoop might headline the bill, streaming data delivery is the opening act that moves the most merchandise.  Most of our customers on initial contact mention Hadoop by name — yet universally the first-delivered and most necessary component has been streaming data delivery into a scalable database and/or Hadoop.

In fact, we’ve had clients who excitedly purchased and setup a Hadoop cluster, and they had plenty of data they’d like to analyze, but had no data in their Hadoop cluster. It may seem obvious once pointed out that you need a way to feed data into your cluster. Enter modern open source tools such as Flume and Storm.  Indeed, Flume was originally created to feed hungry Hadoop clusters with streaming log data.

What people are now realizing though is just how powerful streaming data delivery tools like these are — that you can realize a surprising amount of analytical power (and even visibility of data as well) while the data is still in flight. These realizations have driven the accelerated adoption of many of these open source streaming technologies, like Esper, Flume, and Storm. I’ve been using Hadoop since ’08, and the adoption demand of Storm outpaces even Hadoop’s ascent.

Another important feature set we evangelize and see validated is what an underlying cloud infrastructure enables for the enterprise.  Cloud-enabled elasticity makes exploratory analytics transformatively more powerful, as companies can scale their infrastructure up and down as needed.

Contrasted to the Petabyte-companies, who focus on 100% cluster utilization, the target metric for a development cluster fit for the Terabyte-company is high downtime — the ability to go from 10 to 100 machines; back down to 10; then rest to 0 machines over the course of a job. Hadoop out-of-the-box doesn’t meet this target, which was one of the most interesting engineering challenges we’ve solved.

So where else does the cloud fit in the Hadoop use case? Being able to safely grow, shrink, and stop/restart Hadoop isn’t just a slider UX control, it’s a fundamental change in developer mindset and capabilities. For example, when we were a 6-person team with an AWS bill that rivaled our payroll, we would run parse stages of jobs on high CPU instances, then slam it shut mid-workflow and bring the cluster up on high memory instances for the graph-heavy stages. As our platform matured, we moved to giving each developer their own cluster; too often Chimp A needed 30 machines for 2 hours, while Chimp B needed 6 machines all day. Most companies would have to compromise with a 30-machine cluster running all day – we’ve been able to reject that approach.

Hadoop Elastic Context 300x239 Breaking Hadoop out to the Larger MarketTuning a Hadoop job to your cluster is fiendishly difficult and time consuming; while tuning your cluster to the job is comparatively straightforward.  Data Scientists at the Terabyte-company shouldn’t be pinned down by the difficulties of working with technologies that weren’t designed for them.  By enabling Hadoop in an elastic context — public or private cloud, internal or outsourced — Infochimps and others working on these challenges are a big part of breaking it out to the larger market.

84493d0d e63a 4f96 ae8b 01f76694dc55 Breaking Hadoop out to the Larger Market

Social Media Schema Mapping: Increasing the Power of Data

Infochimps recently developed a unified system for six different social media schemas from Gnip and Moreover. Gnip normalizes data from Facebook, Twitter, and Youtube into Activity Streams. Moreover feeds of forums, blogs, and news reports are normalized as XML in the Atom Syndication Format. Within this case study, I’ll illustrate that big data is not only composed of terabytes of information, but it can also come in a variety of structures and formats.

In research and case studies chronicling the integration of data and databases, problems with schema matching are consistently encountered. Schema matching is the process of mapping fields that share the same properties to one another. Even though the process can be automated, optimal results require thoughtful human arbitration. For example, take the integration of the following three raw feed snippets, and how we merged them and reconciled their similarities and differences.

Raw Feeds:


<title>The Data Era-Moving from 1.0 to 2.0</title>
<author><name>Infochimps Blog</name><url></url></author><link rel=”alternate” href=”″/>
<summary>…I describe it as Big Data 1.0 versus Big Data 2.0.</summary>


“body”=>”The Data Era – Moving from 1.0 to 2.0“,
“summary”=>”The Data Era – Moving from 1.0 to 2.0“,


<title>Infochimps posted a bookmark to Facebook</title>
<category term=”BookmarkPosted” label=”Bookmark Posted”/>
<link rel=”alternate” type=”html” href=”″/>
<activity:object>    <activity:object-type></activity:object-type>
<title>Welcome Jim Kaskade, Infochimps’ new CEO`</title>
<content>Our vision for Infochimps leverages the power of Big Data….</content>
<summary>It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…</summary>
<link rel=”alternate” type=”html” href=”″/>

Looking at the snippets above, a computer would most likely match the title in Moreover and Facebook to the title in This seems like the right thing to do, right? No, it’s wrong. The Mapping chart below and the snippets above illustrate the heart of the mapping process: taking raw data and making sense of it.  

This is the kind of craziness you might encounter:

  • In Moreover, the title holds the name of the blog entry: “The Data Era-Moving from 1.0 to 2.0
  • In Facebook,
    • The top-level “title” is the name of the activity: “Infochimps posted a bookmark to Facebook”, “Infochimps posted a note to Facebook”, or “Infochimps posted a photo to Facebook”
    • If someone posted a link, the “title”, one level down (in Activity:Object.title), is the name of the link, “Welcome Jim Kaskade, Infochimps’ new CEO“; the case is different for a photo and for note.
  • Meanwhile in the Twitter-ville stream, the idea of a “title” does not even exist

Mapping Chart
 Social Media Schema Mapping: Increasing the Power of Data

Unified Schema:


“title”=>”The Data Era-Moving from 1.0 to 2.0”,
“text”=>”…I describe it as Big Data 1.0 versus Big Data 2.0.”,
“provider”=>”Infochimps Blog”,
“author”=>{“name”=>””, “url”=>””},


“text”=>”The Data Era – Moving from 1.0 to 2.0“,
“provider”=>{“name”=>”Twitter”, “url”=>””},
“author”=>{“name”=>”Infochimps”, “url”=>””}


“description”=>”Our vision for Infochimps leverages the power of Big Data…
“title”=>”Welcome Jim Kaskade, Infochimps’ new CEO“,
“text”=>”It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…”,
“provider”=>{“name”=>”Facebook”, “url”=>””},
“author”=>{“name”=>”Infochimps”, “url”=>””}

To create the unified schema, I followed the vocabulary and structure for CreativeWork from  The six feeds were molded around those properties, harking back to another project I worked on, the Infochimps Simple Schema (ICSS). ICSS was specifically developed to integrate different types of data such as Twitter, Foursquare, Weather data, and Wikipedia. After matching data, I omitted redundant data that would hinder the formation of a streamlined schema.

 Social Media Schema Mapping: Increasing the Power of Data

In addition to the semantic unification, was the syntactic unification. We found JSON to be the best lingua franca for data exchange. Some of the data was XML-based, which implies complex processing. This was a relatively fast process, not directly as a result of our tools, but also because of the Moreover and Gnip structures. Due to their tidy schemas, we were allowed to use a simpler library – in Ruby, we use Crack; anything in the XML::Simple family would work. With gorillib/model available through Gorillib library, my life was easier, turning raw documents into active intelligent code objects instead of passive bags of data.

This case study illustrates how easily data value can get lost when working with diverse data sources. Most importantly, it highlights the benefits of successfully solving the inherent challenges and the variety of tools and expertise necessary to do so. Merging six different schemas into one semantically-consistent structure dramatically increases the power of data. When data is unified, effective data integration and processing is possible. A recent blog post by our CEO Jim Kaskade, further highlights the advantages of unifying and integrating data: Big Data Means Leveraging All Customer Channels.

blog platform demo v21 Social Media Schema Mapping: Increasing the Power of Data


Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

An old 1978 ad slogan from Scrubbing Bubbles stated that, “We work hard so you don’t have to” – essentially promising customers that they would take care of the “dirty work” and let the customer reap the benefit of the clean, finished product.

The same holds true today here at Infochimps. Our mission is to do the heavy lifting and seamlessly handle your big data implementations –removing the requirement for expensive integration or specialists–allowing you to focus on generating insights from data, not managing Big Data infrastructure. We provide you the insights you need to make data-driven decisions, speed your application development, and, ultimately, improve your operational efficiencies and time to market.

We are proud to announce today the latest version of our Big Data Platform, a managed, fully optimized and hosted service for deploying Big Data environments and apps in the cloud, on-demand.

Key new features include:

New Data Delivery Service

  • Based on the open source Apache Flume project
  • Integrates with your existing data environment and data sources with a combination of out-of-the-box and custom connectors
  • High scalability and optimization of distributed ETL (Extract, Transform, Load) processes, able to handle many terabytes of data per day
  • Both distributed real-time analysis and distributed Hadoop batch analysis

Real-time, Data Streaming Framework

  • You can use familiar programming languages such as Ruby to vastly simplify performing real-time analytics and stream processing, all within the Infochimps Data Delivery Service
  • Extends Infochimps’ Wukong open source project, which lets developers use Ruby micro-scripts to perform Hadoop batch processing

We’ve bundled together everything you need to install the platform, making it faster than ever to get a big data project off the ground — a configured solution can be deployed in just a few hours.

The Infochimps Platform is capable of executing on hundreds of data sources and many terabytes of data throughput, delivering scalability to any type and quantity of database, file system or analytic system.

Check out our other new features in today’s press release.

Also, read GigaOm’s take on our news here.

blog platform demo v21 Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

Watch the Webcast: Real Time Analytics: The Future of Big Data in the Agency

A couple of weeks ago, our Chief Science Officer, Dhruv Bansal presented a webcast on how Big Data is changing the game for agencies looking to take their social media technologies and customer insight practices to the next level.  In this video, you’ll learn how agencies can build their own Big Data platform – enabling them to go from data sources to selling insights – in a fraction of the time expected, and at a fraction of the cost.

Watch the video and you can learn more about how the Infochimps Platform enables some of the world’s top agencies to broaden and scale their proprietary product offerings through:

  • Sentiment and Influencer Analysis
  • Client Customer Insights
  • Real-Time Social Media Analytics
  • Infographic and Report Generation
  • Meme and/or Topic Tracking
  • Cross-Channel Reporting
  • Campaign Personalization/Behavorial Insights

 Watch the Webcast: Real Time Analytics: The Future of Big Data in the Agency

Why the American Community Survey is Important

The American Community Survey is an ongoing statistical survey that samples a small percentage of the population every year. It’s one of our most popular APIs in the Data Marketplace and the data within it provides the key data for the Digital Elements IP Intelligence Demographics API.

Learn more about the importance and usefulness of this annual supplement to the US Census.

(via Flowing Data)

Take a Tour of Our Big Data Platform

Sometimes, when we are trying to explain what Infochimps does, it can be tough to help folks understand the total package. To help with this, we put together a tour of the Infochimps Platform. Now, you can discover how we can work with your team to take data from the sources you need, make it useful, and deliver the insights you need to improve your business. Check it out!

chimpworld Take a Tour of Our Big Data Platform

Exploring Big Data as an Agency Product

↳ The Future of Big Data in the Agency

Join us for our free webinar on Real-Time Analytics for Agencies

Big Data is changing the game for companies of all shapes and sizes, including agencies looking to take their social media technologies and customer insight practices to the next level. But, managing the massive velocity, volume, and variety of social media and otherdata sets at scale can be a huge challenge. Infochimps has built the largest open marketplace of data sets in world. Now, we’ve now opened up our platform and work with some of the world’s top digital, advertising, and PR agencies, which use the Infochimps platform to broaden and scale their proprietary data offerings through:

  • Sentiment and Influencer Analysis
  • Client Customer Insights
  • Real-Time Social Media Analytics
  • Infographic and Report Generation
  • Meme tracking

We’re having a webcast on Thursday, May 31 @ 11:00 CST, titled Real-Time Analytics: The Future of Big Data in the Agency.  Infochimps’ co-Founder, Dhruv Bansal, one of the world’s leading data scientists, will present a quick demonstration on how agencies can build their own Big Data infrastructure, distribute costs across multiple clients while growing their product offerings with Big Data – in a fraction of the time you’d expect and for a fraction of the cost of Big Data talent, enterprise consultants and/or custom enterprise solutions.  We’d love for you to attend and participate.

Learn More

Why Real-Time Analytics? [Free White Paper]

realtime analytics Why Real Time Analytics? [Free White Paper]

When you think Big Data, the first words that come to mind are often Hadoop and NoSQL, but what do these technologies actually mean for your business?  Different Big Data technologies have different use cases where they work best.  For your real-time Big Data challenges often a very different class of tools must be implemented.

In this free white paper, we’ll explore:

  • How to create a flexible architecture that allows you to use the best Big Data tools and technologies for the job at hand
  • Where Hadoop analysis and NoSQL databases work and where they can fall short
  • How Hadoop differs from real-time analytics and stream processing approaches
  • Visual representations of how real-time analytics works and real world use cases
  • How to leverage the Infochimps Platform to perform real-time analytics