Monthly Archives October 2012

Next Gen Real-time Streaming with Storm-Kafka Integration

At Infochimps, we are committed to embracing cutting edge technology, while ensuring that the latest Big Data innovations are enterprise-ready. Today, we are proud to deliver on that promise once again by announcing the integration of Storm and Kafka into the Cloud::Streams component of the Infochimps Cloud.

StormKafka 1024x578 Next Gen Real time Streaming with Storm Kafka Integration

Cloud::Streams provides solutions for challenges involving:

  • Large-scale data collection – clickstream web data, social media and online monitoring, financial market data, machine-to-machine data, sensors, business transactions, listening to or polling application APIs and databases, etc.
  • Real-time stream processing – real-time alerting, tagging and filtering, real-time applications, fast analytical processing like fraud detection or sentiment analysis, data cleansing and transformation, real-time queries, distribution to multiple clients, etc.
  • Analytics system ETL – providing normalized/de-normalized data using customer-defined business logic for various analytics data stores and file systems including Hadoop HDFS, HBase, Elasticsearch, Cassandra, MongoDB, PostgreSQL, MySQL, etc.

Storm and Kafka

Recently in my guest blog post on TechCrunch, I mentioned why you should care about Storm and Kafka.

“With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.”

Ultimately, Storm and Kafka form the best enterprise-grade real-time ETL and streaming analytics solution on the market today. Our goal is to put the same technology that Twitter uses to process over 400 million tweets per day — in your hands. Other companies that have adopted Storm in production include Groupon, Alibaba, The Weather Channel, FullContact, and many others.

Nathan Marz, Storm creator and senior Twitter engineer, comments on Storm’s rapid growth:

“Storm has gained an enormous amount of traction in the past year due to its simplicity, robustness, and high performance. Storm’s tight integration with the queuing and database technologies that companies already use have made it easy to adopt for their stream computing needs.”

Storm solves a broad set of use cases, including “processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more.”

Apache Kafka, which was developed by LinkedIn to power its activity streams, provides an additional reliability guarantee, robust message queueing, and distributed publish-subscribe capabilities.


Cloud::Streams is fault-tolerant and linearly scalable, and performs enterprise data collection, transport, and complex in-stream processing. In much the same way that Hadoop provides batch ETL and large-scale batch analytical processing, Cloud::Streams provides real-time ETL and large-scale real-time analytical processing — the perfect complement to Hadoop (or in some cases, what you needed instead of Hadoop).

Cloud::Streams adds important enterprise-class enhancements to Storm and Kafka, including:

  • Integration Connectors to your existing tech environment for collecting required data from a huge variety of data sources in a way that is robust yet as non-invasive as possible
  • Optimizations for highly scalable, reliable data import and distributed ETL (extract, transform, load), fulfilling data transport needs
  • Developer Toolkit for rapid development of decorators, which perform the real-time stream processing
  • Guaranteed delivery framework and data failover snapshots to send processed data to analytics systems, databases, file systems, and applications with extreme reliability
  • Rapid solution development and deployment, along with our expert Big Data methodology and best practices

Infochimps has extensive experience implementing Cloud::Streams, both for clients and for our internal data flows including large-scale clickstream web data flows, massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and more.

Obviously, data failover and optimizations are key to enterprise readiness. Above and beyond that though, Cloud::Streams is a joy to work with because of its flexible Integration Connectors and the Developer Toolkit. No matter where your data is, you can access and ingest it with a variety of input methods. No matter what kind of work you need to perform (parse, transform, augment, split, fork, merge, analyze/process, …) you can quickly develop that processor unit, test it, and deploy it as a Cloud::Streams decorator.

One of our most recent customers was able to build an entire production application flow for large-scale social media data analysis using the Infochimps Cloud development framework in just 30 days with only 3 developers. That is both unheard of from an enterprise timeline perspective, as well as an amazing case of business ROI. Big Data is too important to spend months and months developing. Your business needs results now, and the Infochimps Cloud leverages the talent you have today for fast project success.

How much is it worth to you to launch your own revenue generating applications for your customers? Or for your internal stakeholders as part of a Big Data business intelligence initiative? How much value would launching 12 months sooner provide your organization? These are questions which we’re trying to make the answer to obvious.

Steve Blackmon, Director of Data Sciences at W2O Group, explains why they are working with Infochimps and Cloud::Streams:

“Storm and Kafka are excellent platforms for scalable real-time data processing. We are very pleased that Infochimps has embraced Storm and Kafka for Cloud::Streams. This new offering gives us the opportunity to supplement our listening and analytics products with Infochimps’ data sources, to integrate capabilities seamlessly with our partners who also use Storm, and to retain Infochimps’ unique technical team to support and optimize our data pipelines.”

More Information

Check out the full press release here, including quotes from CEO Jim Kaskade and co-founder and CTO Flip Kromer.

You can access additional resources from the Cloud::Streams web page or our general resources directory.

Lastly, check out our previous product announcements! In February, we launched the Infochimps Platform. In April we launched Dashpot as well as our support of OpenStack. In August, we announced the Platform’s newest release.

6fefa857 2e95 4742 9684 869168ac7099 Next Gen Real time Streaming with Storm Kafka Integration

The 3 Waypoints of a Data Exploration

Part of our goal is to unlock the big data stack for exploratory analytics.

How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.

The 3 Waypoints of a Data Exploration:

  • What you knew — are they validated by the data?
  • What you suspect — how do your hypotheses agree with reality?
  • What you would have never suspected — something unpredictable in advance?

In Practice:
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message.  I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.

I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).

Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.

Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.

That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):

3 Waypoints 1024x742 The 3 Waypoints of a Data Exploration
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.

Now let’s look at the 3 Waypoints.

What We Knew: What I really mean by “knew”  is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:

  • Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
  • Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
  • Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).

What We Suspected:

First, about the clusters themselves:

  • Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
  • Japanese and Chinese are mashed together, because both use characters from the Han script.

Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.

Things we suspected about the connections:

  • Nearby countries will show more “mashups”.  Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
  • Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.

What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.

The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.

As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.

However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?

3 Waypoints Smiley The 3 Waypoints of a Data Exploration

Because looks like a smiley face.
The common bond across all of humanity is a smile.

6fefa857 2e95 4742 9684 869168ac7099 The 3 Waypoints of a Data Exploration

Predictive Analytics Summit in New York

Predictive Analytics Predictive Analytics Summit in New York

Our partners at *IE. would like to introduce you to the exclusive Predictive Analytics Summit for Banking & Financial Services, at the Conrad New York on December 6 & 7, 2012.

This summit will bring together the leaders and innovators from the banking industry together for two days of unparalleled networking with like-minded professionals.

This event will combine keynote presentations with open discussion and interactive workshops in an event acclaimed for its innovative insight. It is a unique opportunity to share challenges and best practices with leaders in a collaborative environment.

Register Today.

34523bb2 2e50 4f42 88a1 5bd9ed0fddac Predictive Analytics Summit in New York

Video: Making Sense of Big Data

Making Sense of Big Data Video: Making Sense of Big DataDid you miss our Making Sense of Big Data: An Infochimps Thought Leadership Webinar? Or maybe you just want to watch the webinar again?

Well you’re in luck. We recorded the webinar for your convenience, watch it here.

Given by Big data expert and Infochimps CEO, Jim Kaskade, the recorded webinar will engage you in discussion about:

  • How to effectively manage, protect and leverage big data in your enterprise
  • How to become a data-driven data organization
  • Best practices – from business problem definition to ROI
  • Compelling use cases for business transformation
  • How to develop data-centric applications – using Infochimps Big Data PaaS

Watch the Webinar

For more recorded webinars, including our High Speed Retail Analytics webinar, visit our Resources Page.

6fefa857 2e95 4742 9684 869168ac7099 Video: Making Sense of Big Data

Case Study: Koupon Media + Infochimps

Koupon Case Study Case Study: Koupon Media + InfochimpsThe Infochimps Big Data Platform provides customers with an affordable and repeatable architecture that helps them see return from their big data efforts. Customers get to data insights fast, with the full power of Big Data technology with developer-friendly simplicity.

Our newest case study highlights Koupon Media, a Digital Campaign Management platform provider. Koupon Media uses the Infochimps Platform to build a real-time demographics dashboard reporting on coupon programs to their customers, a true competitive advantage.

Download the case study here to read about how Koupon Media gained their competitive advantage with Infochimps.

Want more? For more customer case studies, visit our Resources Page.

DeepDive 728px v3 Case Study: Koupon Media + Infochimps

Is “Big Data” the Wrong Term?

It’s likely that, like myself, you have heard again and again about “big data“, its 3 V’s, and the Hadoop brand. Yes, volume, velocity, and variety of data are making it difficult to use traditional data solutions like BI cubes, relational databases, and bespoke data pipelines. The world needs new superheroes like Hadoop, NoSQL, NewSQL, DevOps, etc. to solve our woes.

Big Data Is Big Data the Wrong Term?

However, these new technologies and approaches have done much more than just solve the problems around petabytes of data and thousands of events per second. They are the right way to do data. That’s why I’m not convinced the term “big data” was a good choice for us to land on as an industry. It’s really “smart data” or “scalable data.” And despite my distaste for adding a version number to buzz phrases, even “Data 2.0” would be more apt.

If you are a CTO/CIO, system architect, manager, consultant, developer, sys admin, or simply an interested professional – my goal is to prompt some initial points on why big data constitutes a good approach to data management and analytics, regardless of the speed and quantity of data.

Scalable Data: Multi-Node Architecture and Infrastructure-as-Code

Multi-node systems with distributed, horizontally scalable systems are always the right way to do infrastructure, no matter the size of your data or the size of your IT team. This wasn’t always the case, but now multi-node systems are as easy to manage as single-node solutions. It’s so easy now because monitoring, logging, management software, and more are all baked right in; systems come to life in a coordinated fashion that hides all the complexity and scales as needed. You can test your infrastructure in the same way you test programming code. While manually testing a multi-node system may be difficult, testing a piece of code is straightforward.

One of the worst things that can happen to an IT team is having to manage major architecture changes. Using open source, multi-node technologies with an infrastructure-as-code foundation lets organizations grow organically and swap tools and software in and out as needed. Simply modify your infrastructure definitions, test your code, and deploy. Additionally, this kind of framework works perfectly with the DevOps approach to system management. Code repositories are collaborative and iterative – giving individual developers empowerment to directly manage infrastructure, while having the safeguards and tests in place to ensure reliability and quality.

Smart Data: Machine Learning and Data Science

You don’t have to have petabytes of data to begin implementing smart algorithms. To run your business more efficiently, you need to be predictive. You must forecast business and market trends before they happen so you can anticipate how to steer your organization. The companies that win will be the ones analyzing and understanding as much data as possible – building data science as a key competency. Big data tools are making it easier to work with data by providing tools like Mahout for machine learning, Hive for business intelligence queries, or R for statistical analysis, which can interface with Hadoop. Because of big data architecture, you can keep data fresh, use a larger swath of data, and use the newest, most powerful tools to perform the analysis and processing.

Agnostic Data: The Right Database for Each Job

New data pipelining frameworks enable real-time stream processing with multi-node scalability and the ability to fork or merge flows. What that means is, you can easily support multiple databases for multiple problems: columnar stores as primary data stores, relational databases for reporting, search databases for data exploration, graph databases for relationship data, document stores for unstructured data, etc. Because of data splitting/merging capabilities, and your DevOps infrastructure ensuring your databases have integrated monitoring and logging, the added burden of having more than one database is minimum. You just have to learn how to interface with the data through easy-to-use APIs and client libraries.

Holistic Data: Hadoop is Not The End All, Be All

Finally, let’s tackle Hadoop specifically. Hadoop is oriented around large-scale batch processing of data. But so much of what big data is includes databases, data integration/collection, real-time stream processing, and data exploration. Hadoop is not a one trick pony, but it’s also not the answer to every data problem known to man.

Frameworks like Flume, Storm, and S4 are making it easier to perform streaming processing such as collecting hundreds of tweets per seconds, thousands of ad impressions per second, or processing data in near real-time as data flows to its destination (whether a database, Hadoop filesystem, etc.). New database technologies are providing more powerful ways of querying data and building applications. R, Hive, Mahout, and more are providing better data scientist tools. Tableau, Pentaho, GoodData, and others are pushing the envelope with data visualization and big data dashboarding.


Big data software and frameworks are the right foundation for data + data integration and collection + data science + statistical analysis + infrastructure management and administration + IT scaling + data-centric applications + data exploration and visualization. Often regardless of data size.

Your organization benefits from adopting these best practices early and working with vendors that understand your company’s problem isn’t just “oh no, I have too much data”. It’s all about return on investment. The big data approach lowers overhead, enables faster and more efficient IT infrastructure management, generates better insights, and puts them to work in your organization.

DeepDive 728px v3 Is Big Data the Wrong Term?

[Image Source]

Big Data + Infochimps + Strata + Hadoop World

Big news about Big Data in the Big Apple. This year, Strata has joined forces with Hadoop World. Join us in our excitement as we support this opportunity to create the largest gathering of the Apache Hadoop community in the world!

What: Strata ConfeStrata Hadoop World Big Data + Infochimps + Strata + Hadoop Worldrence + Hadoop World
When: Oct 23-25, 2012
Where: Hilton Hotel, New York, NY
Why:  Learn the Tools and Techniques That Make Data Work
O’Reilly’s Strata Conference brings big data back to the Big Apple for its second year October 23-25, 2012, at the Hilton New York. Co-presented with Cloudera and joining forces with Hadoop World, Strata New York explores the changes brought to technology and business by big data, data science, and pervasive computing. Build a data-driven business, learn the latest on the skills & technologies you need to make data work, and join the largest gathering of Apache Hadoop users in the world.

Register Today and Save 20% with code INFO

Follow Strata at @strataconf and hashtag #strataconf on Twitter.

6fefa857 2e95 4742 9684 869168ac7099 Big Data + Infochimps + Strata + Hadoop World

Big Data Projects, Short Survey, Enter to Win

Big Data Survey Big Data Projects, Short Survey, Enter to Win

When doing Big Data projects, what do you want your execs to know? 
We value your opinion! In under 10 minutes, take this short survey for a chance to win over $500 in Amazon gift cards and one of 5 memberships to, one of the largest community-driven sites that focuses on providing relevant news and information on data management, collaboration software, development tools and cloud computing to help information technology (IT) professionals succeed in the field.

Take the Survey Here!


Thought Leadership Webinar   Register Today Big Data Projects, Short Survey, Enter to Win

Live Webcast: Making Sense of Big Data

Thought Leadership Webinar   Register Today Live Webcast: Making Sense of Big Data

Title:Making Sense of Big Data: An Infochimps Thought Leadership Webinar
Date: Thursday, October 11, 2012
Time: 10a Pacific/12p Central/1p Eastern

Register Today!  

Big data is likely the most hyped term in tech of the past two years; however, amidst all the hype, we may have actually missed the point of having all of this data in the first place: to generate more value for businesses. Importantly, it’s this gap between hype and value that speaks to why we need to define how to leverage big data technology in the enterprise.

Register for this live webcast and listen to big data expert and Infochimps CEO, Jim Kaskade, explain where to start and how to effectively manage, protect and leverage the growing amounts of data in your enterprise. In addition, you will hear engaging discussion about:

  • How to become a data-driven data organization
  • Best practices – from business problem definition to ROI
  • Compelling use cases for business transformation
  • How to develop data-centric applications – using Infochimps Big Data PaaS

Join the webcast here. Looking forward to seeing you Thursday, October 11th!