Monthly Archives January 2011

Help Release Over 40,000 Songs with Lyrics at

My friend Tahir Hemphill has built the Hip Hop Word Count, a searchable database of over 40,000 songs with lyrics and metadata – including dates and geolocation of the artists.  Check out Tahir talking about the project:

He was picked up in ReadWriteWeb recently and he’s raised over $6,000 through his Kickstarter campaign, from the likes of Clay Shirky no less, to launch the service publicly.  And he’s started to share his data on Infochimps, now you can download a pack of Jay-Z lyrics.  You can find similar data on Infochimps by searching the music tag.

Show your support for another developer/artist that’s doing something cool with data, and contribute to his fundraising campaign. Tahir will be using the proceeds to release the data, and his tool, to the public.

Stay tuned next week for a release of data from the Million Song Dataset project, a massive dataset that catalogs the features of a million songs. It’s music data like this and from the HHWC project that help create web services like Pandora, neat graphics about whether crunk was first used in the South, and that make the dreams of us data hobbyists come true.

Find Us at the Strata Conference February 1-3

The O’Reilly Strata Conference takes place in Santa Clara, California from February 1-3. According to the website,

Unprecedented computing power and connectivity are bringing new layers of experience to our lives: a change that brings both opportunity and the challenge of new technologies and skills. The future belongs to those who understand how to collect and use their data successfully.

Our CTO Flip Kromer will be giving a talk entitled “Big Data, Lean Startup: Data Science on a Shoestring”. At Infochimps, we’ve been able to craft a team of data scientists by drawing upon smart, enthusiastic hires from our nearby university in untraditional areas such as non-linear dynamics and statistical physics, and equipping them with the process and tools that accelerate their transformation into bona fide data scientists. Drawing from both his experiences as a teacher and his vast programming and data experience, Flip developed our methodology that embraces failure and constantly pushes people outside of their comfort zone, ultimately resulting in better, smarter scientists.

Our Director of Product and Marketing Dennis Yang will also be giving a talk on Data Marketplaces. Does information really want to be free? While the Internet is full of open data, there’s plenty of data companies are willing to pay handsomely for — particularly if it’s timely and well aggregated. As a result, data marketplaces are a burgeoning business. This panel will look at the market for data, and where it’s headed.

The Strata Conference is sold out, but you can follow updates on Twitter. We’ll be tweeting the event from @infochimps, and the hashtag for this event is #strataconf.

Installing Wukong

You can find more posts on big data, Hadoop, Pig, and Wukong at my personal blog, Data Recipes.
(The Data Chef)

Wukong is hands down the simplest (and probably the most fun) tool to use with hadoop. It especially excels at the following use case:

You’ve got a huge amount of data (let that be whatever size you think is
huge). You want to perform a simple operation on each record. For
example, parsing out fields with a regular expression, adding two fields
together, stuffing those records into a data store, etc etc. These are
called map only jobs. They do NOT require a reduce. Can you imagine
writing a java map reduce program to add two fields together? Wukong
gives you all the power of ruby backed by all the power (and
parallelism) of hadoop streaming. Before we get into examples, and there
will be plenty, let’s make sure you’ve got wukong installed and running

Installing Wukong

First and foremost you’ve got to have ruby installed and running on your
machine. Most of the time you already have it. Try checking the version
in a terminal:

$: ruby --version
ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]

If that fails then I bet google can help you get ruby installed on
whatever os you happen to be using.

Next is to make sure you’ve got rubygems installed

$: gem --version

Once again, google can help you get it installed if you don’t have it.

Wukong is a rubygem so we can just install it that way:

sudo gem install wukong
sudo gem install json
sudo gem install configliere

Notice we also installed a couple of other libraries to help us out (the
json gem, the configliere gem, and the extlib gem). If at any time you
get weird errors (LoadError: no such file to load — somelibraryname)
then you probably just need to gem install somelibraryname.

An example

Moving on. You should be ready to test out running wukong locally now.
Here’s the most minimal working wukong script I can come up with that
illustrates a map only wukong script:

#!/usr/bin/env ruby

require 'rubygems'
require 'wukong'

class LineMapper < Wukong::Streamer::LineStreamer
 def process line
   yield line
end, nil).run

Save that into a file called wukong_test.rb and run it with the

cat wukong_test.rb | ./wukong_test.rb

If everything works as expected then you should see exactly the contents
of your script dump onto your terminal. Lets examine what’s actually
going on here.

Boiler plate ruby

First, we’re letting the interpreter know we want to use ruby with the
first line (somewhat obvious). Next, we’re including the libraries we

The guts

Then we define a class in ruby for doing our map job called LineMapper.
This guy subclasses from the wukong LineStreamer class. All the
LineStreamer class does is simply read records from stdin and gives them as arguments to
the LineMapper’s process method. The process method then does nothing
more than yield the line back to the LineStreamer which emits the line
back to stdout.

The runner

Finally, we have to let wukong know we intend to run our script. We
create a new script object with LineMapper as the mapper class and nil as the reducer class.

More succinctly, we’ve written our own cat program. When we ran the
above command we simply streamed our script, line by line, through the
program. Try streaming some real data through the program and adding
some more stuff to the process method. Perhaps parsing the line with a
regular expression and yielding numbers? Yielding words? Yielding
characters? The choice is yours. Have fun with it.

Meatier examples to come.

Graph Processing With Wukong and Hadoop

As a last (for now) tutorial oriented post on Wukong, let’s process a network graph.

Get Data

This airport data (airport edges) from Infochimps is one such network graph with over 35 million edges. It represents the number of flights and passengers transported between two domestic airports in a given month. Go ahead and download it.

Explore Data

We’ve got to actually look at the data before we can make any decisions about how to process it and what questions we’d like answered:

$: head data/flights_with_colnames.tsv | wu-lign
origin_airport destin_airport passengers flights month
MHK AMW 21 1 200810
EUG RDM 41 22 199011
EUG RDM 88 19 199012
EUG RDM 11 4 199010
MFR RDM 0 1 199002
MFR RDM 11 1 199003
MFR RDM 2 4 199001
MFR RDM 7 1 199009
MFR RDM 7 2 199011

So it’s exactly what you’d expect; An adjacency list with (origin node,destination node,weight_1,weight_2,timestamp). There are thousands of data sets with similar characteristics…

Ask A Question

A simple question to ask (and probably the first question you should ask of a graph) is what the degree distribution is. Notice there are two flavors of degree in our graph:

1. Passenger Degree: For a given airport (node in the graph) the number of passengers in + the number of passengers out. Passengers in is called the ‘in degree’ and passengers out is (naturally) called the ‘out degree’.

2. Flights Degree: For a given airport the number of flights in + the number of flights out.

Let’s write the question wukong style:

#!/usr/bin/env ruby

require 'rubygems'
require 'wukong'

class EdgeMapper < Wukong::Streamer::RecordStreamer
# Yield both ways so we can sum (passengers in + passengers out) and (flights
# in + flights out) individually in the reduce phase.
def process origin_code, destin_code, passengers, flights, month
yield [origin_code, month, "OUT", passengers, flights]
yield [destin_code, month, "IN", passengers, flights]

class DegreeCalculator < Wukong::Streamer::AccumulatingReducer
# What are we going to use as a key internally?
def get_key airport, month, in_or_out, passengers, flights
[airport, month]

def start! airport, month, in_or_out, passengers, flights
@out_degree = {:passengers => 0, :flights => 0}
@in_degree = {:passengers => 0, :flights => 0}

def accumulate airport, month, in_or_out, passengers, flights
case in_or_out
when "IN" then
@in_degree[:passengers] += passengers.to_i
@in_degree[:flights] += flights.to_i
when "OUT" then
@out_degree[:passengers] += passengers.to_i
@out_degree[:flights] += flights.to_i

# For every airport and month, calculate passenger and flight degrees
def finalize

# Passenger degrees (out, in, and total)
passengers_out = @out_degree[:passengers]
passengers_in = @in_degree[:passengers]
passengers_total = passengers_in + passengers_out

# Flight degrees (out, in, and total)
flights_out = @out_degree[:flights]
flights_in = @in_degree[:flights]
flights_total = flights_in + flights_out

yield [key, passengers_in, passengers_out, passengers_total, flights_in, flights_out, flights_total]

# Need to use 2 fields for partition so every record with the same airport and
# month land on the same reducer
:partition_fields => 2 # use two fields to partition records

Don’t panic. There’s a lot going on in this script so here’s the breakdown (real gentle like):


Here we’re using wukong’s RecordStreamer class which reads lines from $stdin and splits on tabs for us already. That’s how we know exactly what arguments the process method gets.

Next, as is often the case with low level map-reduce, we’ve got to be a bit clever in the way we yield data in the map. Here we yield the edge both ways and attach an extra piece of information (“OUT” or “IN”) depending on whether the passengers and flights were going into the airport in a month or out. This way we can distinguish between these two pieces of data in the reducer and process them independently.

Finally, we’ve carefully rearranged our records such that (airport,month) is always the first two fields. We’ll partition on this as the key. (We have to say that explicitly at the bottom of the script)


We’ve seen all these methods before except for one. The reducer needs to know what fields to use as the key (it defaults to the first field). Here we’ve explicitly told it to use the airport and month as the key with the ‘get_key’ method.

* start! – Here we initialize the internal state of the reducer with two ruby hashes. One, the @out_degree will count up all the passengers and flights out. The @in_degree will do the same but for passengers and flights in. (Let’s all take a moment and think about how awful and unreadable that would be in java…)

* accumulate – Here we simply look at each record and decide which counters to increment depending on whether it’s “OUT” or “IN”.

* finalize – All we’re doing here is taking our accumulated counts, creating the record we care about, and yielding it out. Remember, the ‘key’ is just (airport,month).

Get An Answer

We know how to put the data on the hdfs and run the script by now so we’ll skip that part. Here’s what the output looks like:

$: hdp-catd /data/domestic/flights/degree_distribution | head -n20 | wu-lign
1B1 200906 1 1 2 1 1 2
ABE 200705 0 83 83 0 3 3
ABE 199206 0 31 31 0 1 1
ABE 200708 0 904 904 0 20 20
ABE 200307 0 91 91 0 2 2
ABE 200703 0 36 36 0 1 1
ABE 199902 0 84 84 0 1 1
ABE 200611 0 753 753 0 18 18
ABE 199209 0 99 99 0 1 1
ABE 200702 0 54 54 0 1 1
ABE 200407 0 98 98 0 1 1
ABE 200705 0 647 647 0 15 15
ABE 200306 0 27 27 0 1 1
ABE 200703 0 473 473 0 11 11
ABE 200309 0 150 150 0 1 1
ABE 200702 0 313 313 0 8 8
ABE 200103 0 0 0 0 1 1
ABE 199807 0 105 105 0 1 1
ABE 199907 0 91 91 0 1 1
ABE 199501 0 50 50 0 1 1

At this point is where you might bring this back down to your local file system, crack open a program like R, make some plots, etc.

And we’re done for now. Hurray.

Data Visualization Prevents Curation Bias in Social Media

A lot of social media analysts are predicting that curation will help solve the issue of social media overload. Curation has been touted as “the chosen” social media buzzword du jour and the new form of search that will prove more useful than Google’s spammed result pages. Rather than paying attention to just anyone and everyone, we will defer to the nine percent of people who actively search for content, and  listen to them on networks like Twitter or Quora.

How does this paint a very different perception of reality? After all, we will be listening to very select sources and filtering out the inconvenient users of social media who may just so happen to disagree with us. We then listen to these same sources over and over. What happens when we happen to encounter someone who either contradicts our life paradigm or is simply too unfamiliar with our priorities to even make conversation?

Visualizing social media data allows us to make sense of massive amounts of raw data in a very clear way. Rather than relying on someone to sift through the noise to find the useful nuggets of information, data visualization gives us a holistic view so that we can make sense of a lot information within seconds.  It also prevents us from shielding our eyes to the inconvenient truths provided by those who just so happen to be outside our social streams.

Rio Akasaka, a first year Master’s student in Human Computer Interaction at Stanford and Infochimps user, created a good use case of how data visualization can help us make sense of what occurs via social media. Rio first downloaded an Infochimps data set of tweets pertaining to the Haiti earthquake that occurred a year ago. Using the Google Maps API, he plotted these tweets on a map to show when they occurred are where they came from.

projecthaiti Data Visualization Prevents Curation Bias in Social Media

You can actually see this data visualization in action here and learn more about how Rio created it here.

How would it alter someone’s perception to see only curated stories about the Haiti earthquake or the aftermath of the Gabriel Gifford shooting versus a bird’s eye version Rio’s visualization provides?

Train to be a Hadoop Jedi Master for $50 at Data Day Austin

At Infochimps, we believe increasing the number of people familiar with handling and making sense of big data is good for the web community as a whole.  That’s why we are happy to contribute our expertise to Data Day Austin, an event put together by Lynn Bender at GeekAustin and our friends at Riptano.

Data Day Austin includes both basic and advanced training in Hadoop as well as Cassandra.  It takes place on Saturday, January 29, 2010 at the Norris Conference Center.  The speaker list is as follows:

Introduction to Cassandra for Java Developers
Nate McCall – Software Developer, Riptano

I Know Where You Are: an introduction to working with location data.
Sandeep Parikh
– Principal, Robotten Labs
Shaun Dubuque – Co-founder, Argia, Inc
Thinking of developing location-based apps? Sandeep and Shaun show you sources for location data and strategies for managing it.

Additional presentations and workshops to be announced shortly.

Hadoop Deep Dive includes:

It’s common to pay a few thousand dollars for a day of Hadoop training. We have Austin’s top Hadoop talent teaming up to give you a day of instruction as part of Data Day Austin. These are not mere presentations. If you so desire, we want you to leave Data Day Austin with a working knowledge of Hadoop.

Hadoop Introduction and Tutorial
Steve Watt (blog) – IBM Big Data Lead, IBM Software Strategy
This introduction includes MapReduce, the Hadoop Distributed File System, and the Hadoop ecosystem.

Higher Order Languages for Hadoop I – Wukong
Flip Kromer Founder and CTO, Infochimps
Wukong allows you to treat your dataset like:
* a stream of lines when it’s efficient to process by lines, * a stream of field arrays when it’s efficient to deal directly with fields
* a stream of lightweight objects when it’s efficient to deal with objects
No one knows more about Wukong that Flip Kromer.

Higher Order Languages for Hadoop II- Pig
Jacob Perkins – Hadoop Engineer, Infochimps
Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability.

Web Crawling and Data Gathering with Apache Nutch
Steve Watt (blog) – IBM Big Data Lead, IBM Software Strategy
The first phase of any analytics pipeline is finding and loading the data. Apache Nutch is a Hadoop based web crawler that acts as an excellent tool to be able to pull down content from the web and load it into the HDFS to make it available for Hadoop Analytics. This session will teach you how to install and configure Nutch, how to use it to crawl and gather targeted content from the web and how to fine tune your crawls through the Nutch API.

Hadoop Analytics for the Business Professional
(BigSheets demonstration with multiple analytic scenarios)
Instructor to be announced shortly

Additional workshops/presentations to be announced…

Be sure to register soon as there is currently Early Bird pricing.  For comments, questions, or sponsorship opportunities, contact

Sharing the Love

Data visualizations are like houses and neighborhoods, monuments even, built on the foundation that Infochimps is laying with our big data gathering and processing. We love it when people do really cool things with the information that we have on our site and just wanted to share a recent example with you. One of our users, Kennedy Elliott (@kennelliott) found subway trend data on our site and used it to make a really cool holiday greeting card that she sent to us. :)5266393170 b9918c1506 Sharing the Love

Welcome Kurt Bollacker to the Infochimps Team!

kurt 300x225 Welcome Kurt Bollacker to the Infochimps Team!

I’m excited to announce that Kurt Bollacker is joining the team here at Infochimps as our consulting Data Scientist. I recently had the pleasure of working with Kurt on our analysis of Republican and Democrat words project, so I’m looking forward to working with him more on some awesome projects.

Kurt is also the Digital Research Director at the Long Now Foundation, which is a much respected group in San Francisco, focused on long term policy and thinking. Brian Eno, Esther Dyson and Stewart Brand are amongst its board members. Previously, Kurt was the Chief Scientist at Metaweb Technologies, which was acquired by Google this year.

Kurt and Flip first met at our first Data Cluster meetup in Austin at SXSW in 2009. Since then, they’ve continued to discuss and collaborate over Wukong, our open source tool that allows data engineers to write Ruby scripts to run big data processing on Hadoop. Kurt received his Ph.D. in Computer Engineering from UT, so he’s no stranger to Austin.

Here at Infochimps, Kurt will be joining our growing data team to design and spec our data pipeline. We ingest lots of data from many different sources, including web pages, regular FTP uploads by suppliers, and APIs, and that data then needs to go through a chain of processes before it is ready for distribution on our site or API. At Metaweb, Kurt had this exact experience in shipping large amounts of data around with Freebase, so we look forward to his vast expertise accelerating our efforts in this area.

Kurt, welcome to the team; we’re all excited to have you on board.