Monthly Archives December 2011

More Facebook Users than Buddhists

Check out this neat visualization of different gatherings of people. Size of the bubble represents population and colors represent different types of gatherings, including religions, countries, online services and mass fatalities.  I’m curious to see the social ramifications as this chart changes (What happens India’s population overtakes China’s in 2030?  Will the population of the internet overtake the population of Christians in the world?)

TheBignumbers 4e56350c22e76 w630 More Facebook Users than Buddhists



Happy Holidays!

As most of the chimps have retreated back to the wilds of such foreign lands as Long Island, Cleveland, and Round Rock, things will be pretty quiet around here until Tuesday, December 27.  If you need us, we’ll still be around to pick up a bananaphone or answer emails.  Until we meet again, enjoy this Richard Feynman documentary…

Have You Lost LDB Yet?

ReasonsToGetAwayTheMostStressfulChristmasMusic 4ef1c42f96f18 w630 Have You Lost LDB Yet?

This Christmas infographic has been circulating for the past few days, so it’s entirely possible you’ve already seen it, but I did want to point out two interesting points:
(1) It’s ironic that Little Drummer Boy comes in as third most relaxing Christmas song, given the stress experienced when one loses the LDB game.
(2) This small studies proves that no one finds Justin Bieber’s music “relaxing”, despite his massive Twitter popularity.

‘Tis the Season to Revitalize the Economy

christmas Tis the Season to Revitalize the Economy(click for full image)

This holiday season, I’ve found myself getting into the Christmas spirit much more than in recent years.  I’ve spent hours browsing toy stores and art fairs, as well as searching Amazon and Threadless for the perfect gifts.  This particular infographic brings up some interesting points about our changing view of holiday shopping.

In particular, I thought it was fascinating that online shoppers are predicted to spend 22% more than shoppers in physical stores.  No wonder this “Last Day to Ship to Receive by Christmas” visualization from has been so widely circulated over the past few weeks.

LastDaytoShiptoReceivebyChristmas 4eeb7af9b9dc8 w630 Tis the Season to Revitalize the Economy

by visually via


How Big is Big Data?

This great infographic from GetSatisfaction explores the growing scope of Big Data.  If you need a primer for Big Data, this is a great place to start understanding the size and potential of the space.

bigdata31 918x4151 How Big is Big Data?


Big Data Predictions for 2012 – Part One

A couple of days ago, O’Reilly’s Edd Dumbill took a look at hot topics in data for the coming year.  His five big data predictions for 2012 were absolutely spot on and illustrate the most important next steps in our ability to find insight in data: visualization, interactivity and abstraction.  As CTO of a tech startup at the center of a lot of these conversations, I thought I’d share some of my thoughts on the matter as well.  A full post with my 2012 big data predictions will be out in the next couple of weeks.

Moving Towards Insight, Not Data

Nobody wants data. It’s costly to store and bothersome gather, parse and make useful. What you do want is insight — the ability to see deeper and make better decisions.  However, even these tools of the future can only take your internal data so far, though. At some point, it becomes important to widen your observation field to include explanatory variables to give your data context.  In other words, at some point, it’s useful to bring in outside data to give your own internal data more shape and meaning.

table Big Data Predictions for 2012   Part One

Explanatory variables give your data context, which leads you to insight.

Data marketplaces such as Infochimps, Factual and Microsoft Azure will take off as CIOs discover that BI tools are for generating questions, not answers.  For example, pretend you are a camping equipment store based in Reno, NV with an e-commerce site.  After years of business, you understand your yearly sales cycle (peaks in the early spring and early fall), who your repeat customers are, who your local competitors are, etc.  However, you may uncover curiosities that cannot be answered by internal data: why do sales peak in early to mid-August and why do so many folks request holds during that time?  Why does your customer spread suddenly change from mostly folks in the Nevada area to folks from all over the country? Your business intelligence tools may fall short here.  Explanatory variables from outside sources, including weather, popularity of competitors (Foursquare check-ins, web traffic, etc) and major festivals in your area (oh HAI Burning Man!) may help you get to the bottom of why your store suddenly becomes flooded with neon furry leg-warmer wearing hippies around Labor Day.

In other words, your data creates questions and outside data creates context and helps you answers questions.  In short, the answer to the too much data problem is more data.

Real-time Insight

In his article, Edd talks about the idea of streaming data processing. I haven’t gotten to play with Esper yet, but Ilya Grigorik has — as he puts it, Esper lets you “store queries not data, process each event in real-time, and emit results when some query criteria is met.”  You can read his overview here.

At a lower and simpler level, we here at Infochimps LOVE Flume. It does one thing, simply and well: gets data from over here to over there, perhaps doing things to it along the way. It was developed for reliable log handling, but there are so many wonderful ways to beautifully misuse Flume:

  • We replaced a set of massive batch jobs to parse scraped data with a simple Flume decorator that turns raw JSON into database-ready records. The new data flow is not only near-real-time, but also simpler, more reliable, and more maintainable.
  • If you’re shipping your weblogs around with Flume, ten lines of Ruby and a tiny little StatsD daemon are all you need to track in real-time the rate, response code and latency of every web request. Joining the Church of Graphs has never been easier.
  • Trying to scale a distributed message queue (RabbitMQ, Resque, etc)? If you care about throughput more than latency, look carefully at Flume — you’re probably better off.

By the way, if timeseries are your bread and butter, check out our friends at Eidosearch.

Data Science Workflows and Tools for Fast Iteration

Edd describes how Hadoop’s batch-oriented processing to be sufficient for many uses, but argues that batch processing isn’t adequate for online and real-time needs.  Spark/Mesos has great promise for the kind of interactive exploration Edd describes. (It’s as close as anyone’s gotten to a REPL for big data exploration.)

At infochimps, our workflow revolves around writing short, readable scripts.  See our chapter in O’Reilly’s Hadoop the Definitive Guide for a case study on how we developed a toolkit for Hadoop streaming in a Ruby environment.

Visualization is King

The most important machine learning algorithms all, at their heart, exploit a graph of one sort or another. Yet simply seeing, in any form, a graph of significant scale is still an oafish RAM-exhausting slog. Far and away the best tool we’ve used is Gephi, an open source interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.

So, how essential is visualization, especially graph visualization, to quality data science?  LinkedIn, home of one of the top data science teams in the world, used its consistently-brilliant/outrageously-unfair hiring advantage to bring aboard Mathieu Bastian, Gephi’s author. LinkedIn derives so much value in just being able to see and understand their data better that they’re funding the full-time development of Gephi, to the benefit of us all.

Married with Children… A Thing of the Past?

marriedwithkids1990 Married with Children... A Thing of the Past?Source: The Washington Post, Married households with children, 1990

Things were different in the 90s.  The internet was just becoming widely adopted and we had not yet heard of the phrase “social media”.  We made the switch from Walkmans to Discmans and hadn’t yet dreamed about all those ubiquitous products we can’t iLive without.  And… the idea of being “married with children” was still fairly common.

marriedwithkids2010 Married with Children... A Thing of the Past?SourceThe Washington Post, Married households with children, 2010


Today, less than 21% of the total households in the US are married with children under 18.  In 1990, this number was 26.3%.  The decline has been steady since the 1960s, when this number was 40.2%!  (By comparison, married households without children have head relatively steady since the 1960s, only dropping from 30.3% to 28.0% between 1960 and 2000.)

Compare this to other stats and you’ll see a decline in married households in general, but a rise in single parent households and folks electing to live alone.  Our changing definition of “the American family” is increasingly evident as folks are waiting until they are older to get married (if they do at all) or foregoing marriage all together and choosing to have children out of wedlock.  Rather than waxing poetic about the potential social and economic implications of this shifting paradigm, I invite you check out this great interactive visualization of Census data from the Washington Post and perhaps dig into our census data to discover your own insights.

We’re Still Living in the 1950s…

… at least right around the Christmas season.  This great XKCD comic from last week shows us that our most popular “current” songs of the holiday season are in fact from the Christmases of Baby Boomer’s childhoods.  Curious to see how this plays out for other holidays (can we consider “Thriller” the official song of Halloween?) and with other entertainment mediums.

tradition Were Still Living in the 1950s...

Greek Debt Crisis? It’s Facebook’s Fault!

… just kidding.  Of course the rise of Facebook users since 2005 did not cause the growth of Greece’s debt crisis, but the two are strangely correlated.  As was the birth of babies named Ava and the housing bubble, Michelle Bachmann’s waining TV coverage and Staten Island Cakes going off the air and more.  This amusing chart reminds us of the difference between correlation and causation.

etc correlation50  01  960 Greek Debt Crisis?  Its Facebooks Fault!

What We Feed Our Kids in School

DoStudentsEatLikePrisoners 4de6d51389526 w630 What We Feed Our Kids in School

by GOOD via

Remember about a month ago when we showcased an infographic comparing the cost of prison versus the cost of going to Princeton? Today, we bring you a disturbing infographic from GOOD that compares the typical prison meal and the typical school cafeteria meal. According to the Washington Post:

While the U.S. Department of Agriculture writes guidelines for what school meals should look like, few schools actually follow them. Just 20 percent of schools served meals that met federal guidelines for fat content, according to a 2007 USDA audit.

Side Note: The same Washington Post article deunks the headline-du-jour of a few weeks ago stating that Congress has declared pizza sauce, or more specifically, tomato paste, a vegetable.  The sad truth?  One-eight of a cup of tomato pasta (two tablespoons aka the amount to cover a slide of pizza) was already considered equivalent to a full serving (half cup) of vegetables. The Obama administration’s new guidelines would have changed the amount of tomato paste required to equal a full serving of vegetables to half a cup.  Congress blocked this change. So, while Congress did not “declare pizza a vegetable”, they did allow the average slice of pizza to continue count towards a full serving of vegetables.