Products & Features

Breaking Hadoop out to the Larger Market

There are a lot of people out there with a Terabyte problem but who lack a Petabyte problem — yet they are forced to try to make use of a stack developed to address Facebook, Yahoo and JP Morgans‘ Petabyte problem. Hadoop out of the box is oriented for achieving 100% utilization of fixed-sized clusters by 12, 50, 100+ person analytics teams. In contrast, the bulk of even forward-thinking enterprises are at the level of just having handed two PhD statisticians a copy of the elephant book, a mis-provisioned cluster, and a slap on the back with a directive to “go find us som’a that insight!”.

There are a few observations we’ve made about these other customers and their differentiated needs that I wanted to share, and point to how we seek to address these with our own product.

Our first major observation is that while Hadoop might headline the bill, streaming data delivery is the opening act that moves the most merchandise.  Most of our customers on initial contact mention Hadoop by name — yet universally the first-delivered and most necessary component has been streaming data delivery into a scalable database and/or Hadoop.

In fact, we’ve had clients who excitedly purchased and setup a Hadoop cluster, and they had plenty of data they’d like to analyze, but had no data in their Hadoop cluster. It may seem obvious once pointed out that you need a way to feed data into your cluster. Enter modern open source tools such as Flume and Storm.  Indeed, Flume was originally created to feed hungry Hadoop clusters with streaming log data.

What people are now realizing though is just how powerful streaming data delivery tools like these are — that you can realize a surprising amount of analytical power (and even visibility of data as well) while the data is still in flight. These realizations have driven the accelerated adoption of many of these open source streaming technologies, like Esper, Flume, and Storm. I’ve been using Hadoop since ’08, and the adoption demand of Storm outpaces even Hadoop’s ascent.

Another important feature set we evangelize and see validated is what an underlying cloud infrastructure enables for the enterprise.  Cloud-enabled elasticity makes exploratory analytics transformatively more powerful, as companies can scale their infrastructure up and down as needed.

Contrasted to the Petabyte-companies, who focus on 100% cluster utilization, the target metric for a development cluster fit for the Terabyte-company is high downtime — the ability to go from 10 to 100 machines; back down to 10; then rest to 0 machines over the course of a job. Hadoop out-of-the-box doesn’t meet this target, which was one of the most interesting engineering challenges we’ve solved.

So where else does the cloud fit in the Hadoop use case? Being able to safely grow, shrink, and stop/restart Hadoop isn’t just a slider UX control, it’s a fundamental change in developer mindset and capabilities. For example, when we were a 6-person team with an AWS bill that rivaled our payroll, we would run parse stages of jobs on high CPU instances, then slam it shut mid-workflow and bring the cluster up on high memory instances for the graph-heavy stages. As our platform matured, we moved to giving each developer their own cluster; too often Chimp A needed 30 machines for 2 hours, while Chimp B needed 6 machines all day. Most companies would have to compromise with a 30-machine cluster running all day – we’ve been able to reject that approach.

Hadoop Elastic Context 300x239 Breaking Hadoop out to the Larger MarketTuning a Hadoop job to your cluster is fiendishly difficult and time consuming; while tuning your cluster to the job is comparatively straightforward.  Data Scientists at the Terabyte-company shouldn’t be pinned down by the difficulties of working with technologies that weren’t designed for them.  By enabling Hadoop in an elastic context — public or private cloud, internal or outsourced — Infochimps and others working on these challenges are a big part of breaking it out to the larger market.





84493d0d e63a 4f96 ae8b 01f76694dc55 Breaking Hadoop out to the Larger Market



Social Media Schema Mapping: Increasing the Power of Data

Infochimps recently developed a unified system for six different social media schemas from Gnip and Moreover. Gnip normalizes data from Facebook, Twitter, and Youtube into Activity Streams. Moreover feeds of forums, blogs, and news reports are normalized as XML in the Atom Syndication Format. Within this case study, I’ll illustrate that big data is not only composed of terabytes of information, but it can also come in a variety of structures and formats.

In research and case studies chronicling the integration of data and databases, problems with schema matching are consistently encountered. Schema matching is the process of mapping fields that share the same properties to one another. Even though the process can be automated, optimal results require thoughtful human arbitration. For example, take the integration of the following three raw feed snippets, and how we merged them and reconciled their similarities and differences.

Raw Feeds:

moreover

<id>http://c.moreover.com/blog-1000</id>
<title>The Data Era-Moving from 1.0 to 2.0</title>
<author><name>Infochimps Blog</name><url>http://blog.infochimps.com</url></author>
http://shop.oreilly.com/product/0636920010203.do<link rel=”alternate” href=”http://c.moreover.com/blog-1000″/>
<summary>…I describe it as Big Data 1.0 versus Big Data 2.0.</summary>
<modified>2012-08-28T20:23:00Z</modified>
<issued>2012-08-28T20:23:00Z</issued>

twitter

{“id”=>”tag:search.twitter.com,2005:220000000″,
“objectType”=>”activity”,
“verb”=>”post”,
“postedTime”=>”2012-08-16T22:12:24.000Z”,
“provider”=>{“objectType”=>”service”,”displayName”=>”Twitter”,
“link”=>”http://www.twitter.com”},
“link”=>”http://twitter.com/infochimps/statuses/2200000000000000000″,
“body”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“object”=>{“objectType”=>”note”,
“id”=>”object:search.twitter.com,2005:220000000″,
“summary”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“link”=>”http://twitter.com/infochimps/statuses/220000000″

facebook

<id>50000_30000000</id>
<created>2012-07-27T21:29:13+00:00</created>
<published>2012-07-27T21:29:13+00:00</published>
<updated>2012-07-27T21:29:43+00:00</updated>
<title>Infochimps posted a bookmark to Facebook</title>
<category term=”BookmarkPosted” label=”Bookmark Posted”/>
<link rel=”alternate” type=”html” href=”http://www.facebook.com/50000/posts/30000000″/>
<service:provider>
<name>Facebook</name>
<uri>www.facebook.com</uri>
<icon/>
</service:provider>
<activity:object>    <activity:object-type>http://activitystrea.ms/schema/1.0/bookmark</activity:object-type>
<id>50000_30000000</id>
<title>Welcome Jim Kaskade, Infochimps’ new CEO`</title>
<subtitle>infochim.ps</subtitle>
<content>Our vision for Infochimps leverages the power of Big Data….</content>
<summary>It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…</summary>
<link rel=”alternate” type=”html” href=”http://www.facebook.com/50000/posts/30000000″/>
</activity:object>

Looking at the snippets above, a computer would most likely match the title in Moreover and Facebook to the title in schema.org. This seems like the right thing to do, right? No, it’s wrong. The Mapping chart below and the snippets above illustrate the heart of the mapping process: taking raw data and making sense of it.  

This is the kind of craziness you might encounter:

  • In Moreover, the title holds the name of the blog entry: “The Data Era-Moving from 1.0 to 2.0
  • In Facebook,
    • The top-level “title” is the name of the activity: “Infochimps posted a bookmark to Facebook”, “Infochimps posted a note to Facebook”, or “Infochimps posted a photo to Facebook”
    • If someone posted a link, the “title”, one level down (in Activity:Object.title), is the name of the link, “Welcome Jim Kaskade, Infochimps’ new CEO“; the case is different for a photo and for note.
  • Meanwhile in the Twitter-ville stream, the idea of a “title” does not even exist

Mapping Chart
 Social Media Schema Mapping: Increasing the Power of Data

Unified Schema:

moreover

“id”=>”http://c.moreover.com/blog-1000″,
“name”=>”",
“description”=>”",
“date_published”=>”2012-08-28T20:23:00Z”,
“title”=>”The Data Era-Moving from 1.0 to 2.0″,
“link”=>”http://c.moreover.com/blog-1000″,
“text”=>”…I describe it as Big Data 1.0 versus Big Data 2.0.”,
“provider”=>”Infochimps Blog”,
“author”=>{“name”=>”", “url”=>”"},

twitter

“id”=>”tag:search.twitter.com,2005:22000000″,
“name”=>”twitter_activity”,
“description”=>”",
“date_published”=>”2012-08-28T22:12:24.000Z”,
“title”=>”",
“link”=>”http://twitter.com/infochimps/statuses/22000000″,
“text”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“provider”=>{“name”=>”Twitter”, “url”=>”http://www.twitter.com”},
“author”=>{“name”=>”Infochimps”, “url”=>”http://www.twitter.com/infochimps”}

facebook

“id”=>”50000_30000000″,
“name”=>”bookmarkposted”,
“description”=>”Our vision for Infochimps leverages the power of Big Data…
“date_published”=>”2012-07-27T21:29:13+00:00″,
“title”=>”Welcome Jim Kaskade, Infochimps’ new CEO“,
“link”=>”http://www.facebook.com/50000/posts/30000000″,
“text”=>”It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…”,
“provider”=>{“name”=>”Facebook”, “url”=>”http://www.facebook.com”},
“author”=>{“name”=>”Infochimps”, “url”=>”https://www.facebook.com/infochimps”}

To create the unified schema, I followed the vocabulary and structure for CreativeWork from schema.org.  The six feeds were molded around those properties, harking back to another project I worked on, the Infochimps Simple Schema (ICSS). ICSS was specifically developed to integrate different types of data such as Twitter, Foursquare, Weather data, and Wikipedia. After matching data, I omitted redundant data that would hinder the formation of a streamlined schema.

 Social Media Schema Mapping: Increasing the Power of Data

In addition to the semantic unification, was the syntactic unification. We found JSON to be the best lingua franca for data exchange. Some of the data was XML-based, which implies complex processing. This was a relatively fast process, not directly as a result of our tools, but also because of the Moreover and Gnip structures. Due to their tidy schemas, we were allowed to use a simpler library – in Ruby, we use Crack; anything in the XML::Simple family would work. With gorillib/model available through Gorillib library, my life was easier, turning raw documents into active intelligent code objects instead of passive bags of data.

This case study illustrates how easily data value can get lost when working with diverse data sources. Most importantly, it highlights the benefits of successfully solving the inherent challenges and the variety of tools and expertise necessary to do so. Merging six different schemas into one semantically-consistent structure dramatically increases the power of data. When data is unified, effective data integration and processing is possible. A recent blog post by our CEO Jim Kaskade, further highlights the advantages of unifying and integrating data: Big Data Means Leveraging All Customer Channels.

blog platform demo v21 Social Media Schema Mapping: Increasing the Power of Data

 

Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

An old 1978 ad slogan from Scrubbing Bubbles stated that, “We work hard so you don’t have to” – essentially promising customers that they would take care of the “dirty work” and let the customer reap the benefit of the clean, finished product.

The same holds true today here at Infochimps. Our mission is to do the heavy lifting and seamlessly handle your big data implementations –removing the requirement for expensive integration or specialists–allowing you to focus on generating insights from data, not managing Big Data infrastructure. We provide you the insights you need to make data-driven decisions, speed your application development, and, ultimately, improve your operational efficiencies and time to market.

We are proud to announce today the latest version of our Big Data Platform, a managed, fully optimized and hosted service for deploying Big Data environments and apps in the cloud, on-demand.

Key new features include:

New Data Delivery Service

  • Based on the open source Apache Flume project
  • Integrates with your existing data environment and data sources with a combination of out-of-the-box and custom connectors
  • High scalability and optimization of distributed ETL (Extract, Transform, Load) processes, able to handle many terabytes of data per day
  • Both distributed real-time analysis and distributed Hadoop batch analysis

Real-time, Data Streaming Framework

  • You can use familiar programming languages such as Ruby to vastly simplify performing real-time analytics and stream processing, all within the Infochimps Data Delivery Service
  • Extends Infochimps’ Wukong open source project, which lets developers use Ruby micro-scripts to perform Hadoop batch processing

We’ve bundled together everything you need to install the platform, making it faster than ever to get a big data project off the ground — a configured solution can be deployed in just a few hours.

The Infochimps Platform is capable of executing on hundreds of data sources and many terabytes of data throughput, delivering scalability to any type and quantity of database, file system or analytic system.

Check out our other new features in today’s press release.

Also, read GigaOm’s take on our news here.

blog platform demo v21 Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

Watch the Webcast: Real Time Analytics: The Future of Big Data in the Agency

A couple of weeks ago, our Chief Science Officer, Dhruv Bansal presented a webcast on how Big Data is changing the game for agencies looking to take their social media technologies and customer insight practices to the next level.  In this video, you’ll learn how agencies can build their own Big Data platform – enabling them to go from data sources to selling insights – in a fraction of the time expected, and at a fraction of the cost.

Watch the video and you can learn more about how the Infochimps Platform enables some of the world’s top agencies to broaden and scale their proprietary product offerings through:

  • Sentiment and Influencer Analysis
  • Client Customer Insights
  • Real-Time Social Media Analytics
  • Infographic and Report Generation
  • Meme and/or Topic Tracking
  • Cross-Channel Reporting
  • Campaign Personalization/Behavorial Insights


 Watch the Webcast: Real Time Analytics: The Future of Big Data in the Agency



Why the American Community Survey is Important

The American Community Survey is an ongoing statistical survey that samples a small percentage of the population every year. It’s one of our most popular APIs in the Data Marketplace and the data within it provides the key data for the Digital Elements IP Intelligence Demographics API.

Learn more about the importance and usefulness of this annual supplement to the US Census.

(via Flowing Data)

Take a Tour of Our Big Data Platform

Sometimes, when we are trying to explain what Infochimps does, it can be tough to help folks understand the total package. To help with this, we put together a tour of the Infochimps Platform. Now, you can discover how we can work with your team to take data from the sources you need, make it useful, and deliver the insights you need to improve your business. Check it out!

chimpworld Take a Tour of Our Big Data Platform

Exploring Big Data as an Agency Product

LEARN MORE ABOUT REAL-TIME ANALYTICS
↳ The Future of Big Data in the Agency

Join us for our free webinar on Real-Time Analytics for Agencies

Big Data is changing the game for companies of all shapes and sizes, including agencies looking to take their social media technologies and customer insight practices to the next level. But, managing the massive velocity, volume, and variety of social media and otherdata sets at scale can be a huge challenge. Infochimps has built the largest open marketplace of data sets in world. Now, we’ve now opened up our platform and work with some of the world’s top digital, advertising, and PR agencies, which use the Infochimps platform to broaden and scale their proprietary data offerings through:

  • Sentiment and Influencer Analysis
  • Client Customer Insights
  • Real-Time Social Media Analytics
  • Infographic and Report Generation
  • Meme tracking

We’re having a webcast on Thursday, May 31 @ 11:00 CST, titled Real-Time Analytics: The Future of Big Data in the Agency.  Infochimps’ co-Founder, Dhruv Bansal, one of the world’s leading data scientists, will present a quick demonstration on how agencies can build their own Big Data infrastructure, distribute costs across multiple clients while growing their product offerings with Big Data - in a fraction of the time you’d expect and for a fraction of the cost of Big Data talent, enterprise consultants and/or custom enterprise solutions.  We’d love for you to attend and participate.

Learn More

Why Real-Time Analytics? [Free White Paper]

realtime analytics Why Real Time Analytics? [Free White Paper]

When you think Big Data, the first words that come to mind are often Hadoop and NoSQL, but what do these technologies actually mean for your business?  Different Big Data technologies have different use cases where they work best.  For your real-time Big Data challenges often a very different class of tools must be implemented.

In this free white paper, we’ll explore:

  • How to create a flexible architecture that allows you to use the best Big Data tools and technologies for the job at hand
  • Where Hadoop analysis and NoSQL databases work and where they can fall short
  • How Hadoop differs from real-time analytics and stream processing approaches
  • Visual representations of how real-time analytics works and real world use cases
  • How to leverage the Infochimps Platform to perform real-time analytics

Announcing Support for OpenStack and the Rackspace Cloud

Infochimps is happy to announce that we now support the next generation Rackspace Cloud, based on OpenStack. Through integration with the OpenStack API the Infochimps Platform can now power big data applications based in the Rackspace Cloud, expanding the reach of the Infochimps Platform and making the running of complex big data infrastructures quick and easy for a broader range of users.

Rackspace customers running the new OpenStack-based Rackspace Cloud Servers can quickly and easily spin up Hadoop clusters to power their big data applications in as little as 20 minutes with a single command using the Infochimps Platform. With the power of Ironfan, Infochimps’ open source provisioning tool, and Dashpot, Infochimps’ visualization and operations dashboard, customers can easily monitor and manage their Big Data operations on an ongoing basis, or leave it to Infochimps to manage it on the Rackspace Cloud for them.

Check out this demo of Infochimps Platform running in the Rackspace Cloud:

Why OpenStack and Rackspace?
From the beginning, the Infochimps Platform has been built on a foundation of open source tools for managing data, aimed at simplifying the experience of working with complex technologies such as Hadoop or Cassandra. Within the Infochimps Platform, Wukong, Ironfan and Swineherd are major open sourced components of the stack. OpenStack supports our open source tradition with its strong open source ecosystem. It is used by and contributed to by not only Rackspace, but organizations such as NASA, Canonical, RedHat, Dell, HP, and AT&T, so its architecture serves a multitude of needs, rather than bending to the whims of a single provider.

OpenStack also encourages standardization among Infrastructure as a Service providers, which ultimately benefits everyone in the market. Clients can make (and remake) decisions based on their businesses’ current day to day needs, without needing to employ a crystal ball to try to predict which provider will be best for them in the long term. By sharing open and standard interfaces, cloud providers can compete on current quality and value, instead of fighting to lock-in customers based on promises.

The modular design of OpenStack is part of what makes standards possible without blocking innovation. There are a set of core APIs that every provider will support, and extensions for added capabilities that not every provider will want to allow. The contracts these APIs provide can be (and often are) fulfilled by different back-end providers, letting each provider make different architectural choices without requiring customers to completely retool to take advantage of them. All of this allows apples-to-apples comparison of provider architectures, without making orange sales impossible.

What does OpenStack mean for Infochimps?
The work we’ve done to support this announcement has enabled us to provide a level of abstraction from the Amazon Web Services environment, and we can deploy our platform in a cloud agnostic way. Many of our customers have asked for implementations on their in-house cloud environments – our OpenStack support allows those implementations to be airlifted in using a common set of APIs that sit on top of whatever infrastructure already exists, instead of one-off installations that require more custom development and introduce brittleness.

Interested in learning more about Infochimps, Rackspace, and OpenStack? Contact us today for more information!

Announcing Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform

Infochimps is happy to announce Dashpot, an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Dashpot gives you real time visibility and control of your Big Data stack running with Infochimps, helping you go from input to insight faster, with our best-in-class Big Data infrastructure and tools.

Here are some of Dashpot’s key features:

  • Business Metrics – Dashpot’s in-stream visualization provides business users with the ability to capture and visualize business metrics on the fly as data is being ingested into their Infochimps Platform. By enabling data to be decorated in-stream through our Flume-based Data Delivery Service, Infochimps enables quick introspection on how a data or business process is performing. Organizations can view spikes or drops in key system or business metrics in near real-time, enabling quicker response to changing business conditions, saving time and helping ensure higher quality and more valuable information in the organization’s ultimate datastore. Infochimps business metrics are designed to provide an intermediate data visualization capability in conjunction with an organization’s existing investments in traditional business intelligence solutions.
  • Cluster Management – Built on the power of Ironfan, Dashpot offers simple Big Data system automation and management with a quick glance view into the servers and clusters currently running. Operations users can easily spin them up and down with a simple button click as their processing needs change, creating significant, easy-to-attain cost savings in machine usage.
  • Systems Monitoring – Dashpot provides integration with popular monitoring packages to provide users with at-a-glance views on Big Data system performance, availability, system integrity and more. Designed to easily integrate with any monitoring product, Infochimps has implemented the popular open source product, Zabbix as its initial reference monitoring solution, integrating Zabbix graphs on system performance and availability in the Infochimps Dashpot dashboard.

Implementing and operating Big Data architectures can be difficult, requiring significant investment of resources and time. By choosing to use the Infochimps Platform, enterprises needn’t worry about the time and hassle of building and maintaining their own infrastructure. When combined with our tools, such as Ironfan and DDS, Dashpot’s simple visualizations and management tools help organizations keep their Big Data system humming, with little operational overhead. Best of all, Dashpot’s in-stream visualizations help provide the insights businesses need to get the most value out of their Big Data infrastructure investment.

Interested in talking about how we can help simplify your Big Data stack?  Contact us today for more information!