Data Mine

A Marketer Learns How to Program

Okay, I have a confession to make. Though I’ve helped write example queries, created wireframes for webpages and heck, even toyed around with our API Explorer, I’ve never written a single line of code. Not terribly surprising for a marketer; however, after hanging out with the chimps for the past six months, I felt like it was finally about time to learn something. But where to start? Would I really have the time to commit to a regular weekly class? Could I make myself curl up with an O’Reilly book on PHP or Ruby with other more pressing projects looming? Clearly, the only way I could make this a priority is for it to be fun!

codecademy A Marketer Learns How to Program

Meet Codecademy, which I’ve been nerding out on all morning.  In fact, in the last hour, I’ve learned how to define variables and starting working with strings, substrings, arrays and if/else.  The best part of the whole thing?  It’s been absolutely delightful.  Much of this comes from the amazingly simple layout, which takes students through progressive lessons on the basics of programming and JavaScript in a command line setting.  The initial concepts are easy to grok and they build beautifully on each other, until suddenly, whoa – did I just define variables, prompt a user for input and create different returns based on user input?

While, this might not sound like much to the hardcore programmers out there, for a marketer who had never written a lick of code in her life as of an hour ago, Codecademy is a pretty rad resource.  Share it with your programming-novice friends and see what they think.  Or, contribute to their lesson plans and get more folks coding!

Outfielders: Step Back from the Centroid

My little league coaches drilled it into us outfielders: “It’s harder to run backward than forward, so stand where you think the batter will probably hit, and then take a few large steps back.” (Full disclosure: my little league career wasn’t so illustrious.)

A reader of my last blog post about Ichiro’s hit locations seems to disagree with my coaches:

“[Based on Ichiro's hit locations], this says to me that the traditional baseball positions are relatively optimal in terms of covering the field…”

To rehash, the June post, Clustering Baseball Data with Weka, gave an example of applying k-means clustering to Suzuki Ichiro’s 2006 x-y hit locations.

poointsandclusters Outfielders: Step Back from the Centroid

Left: The unprocessed hit locations—with outliers like home runs removed.
Right: The same set divided into six clusters using the k-means algorithm and the Euclidian distance. Centroids are red.

(more…)

Mapping Tools for Developers

old world map Mapping Tools for Developers


This is a great time to be a geodeveloper. There’s more spatial data, geo-processing tools, geo enabled storage and mapping tools than ever.

Let’s start with storage – not too long ago geo developers had two choices, file formats or proprietary object-relational databases. Today there are production ready open source object-relational databases such as PostgreSQL/PostGIS and MySQL; even mobile devices have lightweight databases with spatial capabilities such as SQLite. In addition to traditional object-relational databases, NoSQL databases such as Cassandra, CouchDB, and MongoDB have a spatial capabilities. Big Table clones such as Hbase can also store spatial data and there is ongoing work for developing a spatial index which facilitates spatial queries and operations. Neo4J is a graph database that also handles spatial data. Finally, even full text search engines such as ElasticSearch provide geospatial search capabilities.

(more…)

Getting Started with the Geo API: Sample Queries

31536 Lenox world globe Getting Started with the Geo API: Sample QueriesYesterday, we launched our new Geo API, creating easy access to millions of geographic data points, organized into one consistent, unified schema.  This new API allows you to ask questions of geo data the way you want; you can query by latitude/longitude/radius, zoom level, bounding box, quadkey and more!

Sounds great, but how do you get started?  We’ve pulled together some Example Queries to help our developers get started with the Geo API.  These sample queries are meant to help developers quickly and easily understand the structure of our API calls, so they can get to the fun part – getting answers to tough questions, building awesome visualizations & applications and having fun with our data!

Here are two examples of how to query our Geo API.  For more, please check out Example Queries and sign up for a free API key today!

Wikipedia articles

I’m building a day-tripper’s travel application. What are good sightseeing locales (based on places with Wikipedia articles) within a one hour drive (approximately 100km) of 105 E 5th St, Austin, TX 78701?

http://api.infochimps.com/encyclopedic/dbpedia/articles/search?g.latitude=30.2669444&g.longitude=-97.7427778&g.radius=100000&apikey=[YOUR API KEY HERE]

Geonames
For my aforementioned day-tripper travel application, I’m also interested in creating a kid’s section and the first thing I’d like to include are amusement parks near the Infochimps office in downtown Austin. (This example uses zoom level instead of radius.)

http://api.infochimps.com/geo/location/geonames/places/search?&g.latitude=30.273054&g.longitude=-97.757598&g.zoom_level=6&f._type=business.amusement_park&apikey=[YOUR API KEY HERE]

We’ll continually improving our API documentation.  Please let us know in the comments if there’s anything in particular you’d like to see next.  Maybe it’s a quadkey calculator or code examples for Ruby and PHP or inclusion of specific data set.  We’re happy to get you what you want – just let us know what we can do!

Clustering Baseball Data with Weka

This is a guest blog post from Peter Hauck, who works as a data analyst at Google.  His experience includes employee compensation optimization and dynamic pricing of live event tickets. He is a graduate of Cornell University with a B.A. in Mathematics and Physics and an M. Eng in Applied Physics.

Greetings sports fans and data nerds! Since 2004, Major League Baseball has published (x,y) “hit locations” of every at bat and for years, Sabermetric and actuarial analysts have turned this and other data into predictions of where individual sluggers will hit in the future. In hopes of optimally positioning players in the field, professional teams and sports commentators pay handsomely for this kind of forecasting.  The models I’ve seen employ data binning and statistical & probabilistic models to get these results.

In a twist, using the GUI software, Weka, I applied k-means clustering to find patterns in single-season hits record holder, Ichiro Suzuki‘s (x,y) hit locations from 2006.  For readers not familiar, clustering is a computational method of splitting a dataset into neighborhoods of similar points.  Unlike most clustering work, using Weka avoids real programming; once the data was loaded into Weka, the computation required about ten mouse clicks.  I think of this as a semi-scientific, exploratory method that offers quick insight and often reliable conclusions.
sOF spray Clustering Baseball Data with Weka (more…)

Simple Flume Decorators with JRuby

Flume is a framework for moving chunks of data around on your network. It’s primary mission isto move log data from where it is generated (perhaps a web server) to someplace where it can actually be used – like an HDFS file system where it can be crunched by Hadoop. Flume’s design is very flexible – the final destination for your data could also be a database like HBase or Cassandra, a search index system like Elastic Search, another file system like an S3 bucket, or any of a myriad of other configurations. Flume will also go to some efforts to make sure that your data is delivered reliably – it includes some tunable reliability features out of the box.

The Flume User Guide does a good job of explaining how its component architecture works. You can configure data flows by chaining together systems of “nodes” – each node is a data moving unit – each has an input (“source”) and an output (“sink”). Sources can conceivably be anything that produces data – flume can tail sets of files on disk, listen to network ports, periodically run programs, etc. Sinks are a little more interesting – they can write data to disk, push data into an network connection, or into a database. Even more interesting, sinks can be composites – you can fan data out to many other sinks, or set up conditional sinks where if data fails to be accepted by the first sink, it will instead be routed to a second sink. Also, you can build “Decorators” that can modify the data as it moves down the path. Flume offers many sources, sinks, and decorators out of the box – but also gives you the ability to write your own through a Java-based plugin API.

Flume chops the data up into a series of “events”. For instance, if you are using flume to tail a log file, every line written to the file gets turned into a flume event. Each event carries with it the body of the data that produced it, as well as some meta-data: the machine that it was collected on, a time-stamp showing when the data was collected, the event’s priority, etc. You can even add your own key-value pairs to an event in its attributes. Flume sinks can store both the body data and the metadata of an event or in some cases, use the metadata to help ensure that the data lands in the right place – like with the “collectorSink” file bucketing mechanism.

To me, and to some of the other primates I work with at Infochimps, decorators are especially interesting. In a decorator, you get to do some processing on your data as it flies from wherever it was produced to its final destination(s). Flume comes with a handful of basic decorators that will allow you to do some small scale processing of flume events. For instance, the “value” decorator lets you set a value in the metadata of an event. The out-of-the-box decorators are not quite sufficient to handle my processing demands. I wanted a little more flexibility, so I wrote (in less time than it took me to write this blog entry) a quick interface to jRuby. Now I have access to my flume data in transit with a powerful scripting engine.

Enough with the intro – lets jump in. The following steps will lead you down the road to processing data on the fly through the flume with a jRuby script:

1. Install flume. Cloudera has good documentation on setting up Flume. I run Ubuntu, so I just added the cloudera apt package repository to my apt sources, and used “apt-get” to install the packages flume, flume-node and flume-master.

2. Get jRuby. If you use apt-get, you will be getting a slightly out-of-date version, but it will do for the moment. The jRuby website has more details If you need it.

3. Get my jRubyPlugin. For it to work, you have to have it and jruby.jar in Flume’s classpath. You can make custom adjustments to Flume’s runtime environment, including Flume classpath changes in the flume-env.sh script in flume/bin. The easy way is to just drop jruby-flume.jar in the /usr/lib/flume/lib directory (or wherever flume landed in your install process). Getting your jRuby envornment completely set up so that you can see jruby gems and stuff is going to involve making adjustments to your environment, but for now, just having the jruby.jar on the classpath will work. I just created a symbolic link to /usr/lib/jruby/lib/jruby.jar in /usr/lib/flume/lib.

( Aside: I don’t know the full answer to getting everything jruby set up in an embedded mode. However, if you add the following to your flume-env.sh script, you will be at least part of the way there
export UOPTS=”-Djruby.home=/usr/lib/jruby -Djruby.lib=/usr/lib/jruby/lib -Djruby.script=jruby”
)

4. You have to tell flume explicitly what classes to load as plugins when the nodes start up. To do this, create or edit “flume-site.xml” in the flume/conf directory. It should contain at least the following:
< ?xml version="1.0"?>
< ?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



flume.plugin.classes
com.infochimps.flume.jruby.JRubyDecorator
List of plugin classes to load.

After you get this in place, restart your flume-master and flume-node. If everything went ok, the services will start up, and you can go to http://localhost:35871/masterext.jsp to see if the plugin loaded successfully. If you see “jRubyDecorator” listed under the decorators, you are in business.

5. Ok, now let’s build a decorator that does something simple. Create a directory somewhere to keep ruby scripts for flume – I like /usr/lib/flume/scripts. The files in this directory need to be readable by the user that flume is running as. Also, if you are in a distributed world, scripts are going to have to be available both on the master and on the machine that will house the logical node that will run the script.

Here is a simple script. Put it in /usr/lib/flume/scripts/reverse.rb:
# reverse.rb — jRubyDecorator script
require ‘java’
java_import ‘com.cloudera.flume.core.EventSinkDecorator’
java_import ‘com.cloudera.flume.core.Event’
java_import ‘com.cloudera.flume.core.EventImpl’
class ReverseDecorator < EventSinkDecorator
def append(e)
body = String.from_java_bytes e.getBody
super EventImpl.new( body.reverse.to_java_bytes, e.getTimestamp, e.getPriority, e.getNanos, e.getHost, e.getAttrs )
end
end
ReverseDecorator.new(nil)
What does it do? Well, it defines a subclass of com.cloudera.flume.core.EventSinkDecorator which redefines the append method. Our special append method builds a new event from an appended event, except that the text of the "body" field is reversed. Not too much nonsense, but we do have to be a little careful with strings. Flume likes its data to be represented as arrays of bytes, but ruby would prefer to deal with strings as Strings, so I convert both ways: String.from_java_bytes() to get a string object, and the to_java_bytes() method on string-like objects to convert back. Hidden in there, is the ruby string method "reverse".

The last line of the append method shows off some of the power of jRuby. It creates a new instance of EventImpl and passes it off to EventSinkDecorator's implementation of append - basically letting the parent class handle all of the difficult work.

Finally, the last line of the script instantiates a new object of the (jRuby!) ReverseDecorator class and returns it to jRubyDecorator. jRubyDecorator is really a factory class for producing decorator instances. It passes off our stuff as a java object, and flume never suspects what has happened.

Does it work? Lets see:
chris@basqueseed:~$ flume shell -c localhost
2011-02-23 17:35:03,785 [main] INFO conf.FlumeConfiguration: Loading configurations from /etc/flume/conf
Using default admin port: 35873
Using default report port: 45678
Connecting to Flume master localhost:35873:45678...
2011-02-23 17:35:03,993 [main] INFO util.AdminRPCThrift: Connected to master at localhost:35873
==================================================
FlumeShell v0.9.3-CDH3B4
Copyright (c) Cloudera 2010, All Rights Reserved
==================================================
Type a command to execute (hint: many commands
only work when you are connected to a master node)

You may connect to a master node by typing:
connect host[:adminport=35873[:reportport=45678]]

[flume localhost:35873:45678] exec config basqueseed console '{jRubyDecorator("/usr/lib/flume/scripts/reverse.rb")=>console}’
[id: 0] Execing command : config
Command succeeded
[flume localhost:35873:45678] quit
So far, so good – the master node has decided that everything is kosher. By the way, be careful with the single and double quotes in flume shell commands. The flume shell is very picky about its input. If you have any structure to your sources or sinks, you must single quote the declaration. Lets now play with with a node:
chris@basqueseed:~$ sudo /etc/init.d/flume-node stop
[sudo] password for chris:
Stopping Flume node daemon (flume-node): stopping node

chris@basqueseed:~$ sudo -u flume flume node_nowatch
/2011-02-23 17:38:21,709 [main] INFO agent.FlumeNode: Flume 0.9.3-CDH3B4
2011-02-23 17:38:21,710 [main] INFO agent.FlumeNode: rev 822c62f0c13ab76921e96dd92e19f68007dbcbe2
2011-02-23 17:38:21,710 [main] INFO agent.FlumeNode: Compiled on Mon Feb 21 13:01:39 PST 2011
…{stuff deleted}…
2011-02-23 17:39:40,471 [logicalNode basqueseed-20] INFO console.JLineStdinSource: Opening stdin source
?was I tam a ti saW
2011-02-23 17:39:45,720 [logicalNode basqueseed-20] INFO debug.ConsoleEventSink: ConsoleEventSink( debug ) opened
basqueseed [INFO Wed Feb 23 17:39:45 CST 2011] Was it a mat I saw?
Holy chute! It works!

I think that is enough for today. Next time, I’ll try some more complicated scripts, deal with attributes and play some games with data flows.

Intro to Wukong, a Ruby Framework for Hadoop

As Flip Kromer was quoted at the Strata Conference, “Java has many many virtues, but joy is not one of them”. A lot of developers might not think they can use Hadoop simply because they never learned or refuse to use Java.

Wukong allows you to leverage the agility and ease of use of Ruby with Hadoop. The same program you write on your machine can be deployed to the cloud.

In this video at Data Day Austin, Infochimps CTO Flip Kromer walks through how you can get started with Wukong.

Many thanks to Lynn Bender at GeekAustin for filming, and DataStax for sponsoring. You can find more videos from Data Day at this Blip Channel.

Installing Wukong

You can find more posts on big data, Hadoop, Pig, and Wukong at my personal blog, Data Recipes.
–JACOB PERKINS
(The Data Chef)

Wukong is hands down the simplest (and probably the most fun) tool to use with hadoop. It especially excels at the following use case:

You’ve got a huge amount of data (let that be whatever size you think is
huge). You want to perform a simple operation on each record. For
example, parsing out fields with a regular expression, adding two fields
together, stuffing those records into a data store, etc etc. These are
called map only jobs. They do NOT require a reduce. Can you imagine
writing a java map reduce program to add two fields together? Wukong
gives you all the power of ruby backed by all the power (and
parallelism) of hadoop streaming. Before we get into examples, and there
will be plenty, let’s make sure you’ve got wukong installed and running
locally.

Installing Wukong

First and foremost you’ve got to have ruby installed and running on your
machine. Most of the time you already have it. Try checking the version
in a terminal:

$: ruby --version
ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]

If that fails then I bet google can help you get ruby installed on
whatever os you happen to be using.

Next is to make sure you’ve got rubygems installed

$: gem --version
1.3.7

Once again, google can help you get it installed if you don’t have it.

Wukong is a rubygem so we can just install it that way:

sudo gem install wukong
sudo gem install json
sudo gem install configliere

Notice we also installed a couple of other libraries to help us out (the
json gem, the configliere gem, and the extlib gem). If at any time you
get weird errors (LoadError: no such file to load — somelibraryname)
then you probably just need to gem install somelibraryname.

An example

Moving on. You should be ready to test out running wukong locally now.
Here’s the most minimal working wukong script I can come up with that
illustrates a map only wukong script:

#!/usr/bin/env ruby

require 'rubygems'
require 'wukong'

class LineMapper < Wukong::Streamer::LineStreamer
 def process line
   yield line
 end
end

Wukong::Script.new(LineMapper, nil).run

Save that into a file called wukong_test.rb and run it with the
following:

cat wukong_test.rb | ./wukong_test.rb
--map

If everything works as expected then you should see exactly the contents
of your script dump onto your terminal. Lets examine what’s actually
going on here.

Boiler plate ruby

First, we’re letting the interpreter know we want to use ruby with the
first line (somewhat obvious). Next, we’re including the libraries we
need.

The guts

Then we define a class in ruby for doing our map job called LineMapper.
This guy subclasses from the wukong LineStreamer class. All the
LineStreamer class does is simply read records from stdin and gives them as arguments to
the LineMapper’s process method. The process method then does nothing
more than yield the line back to the LineStreamer which emits the line
back to stdout.

The runner

Finally, we have to let wukong know we intend to run our script. We
create a new script object with LineMapper as the mapper class and nil as the reducer class.

More succinctly, we’ve written our own cat program. When we ran the
above command we simply streamed our script, line by line, through the
program. Try streaming some real data through the program and adding
some more stuff to the process method. Perhaps parsing the line with a
regular expression and yielding numbers? Yielding words? Yielding
characters? The choice is yours. Have fun with it.

Meatier examples to come.

Graph Processing With Wukong and Hadoop

As a last (for now) tutorial oriented post on Wukong, let’s process a network graph.

Get Data

This airport data (airport edges) from Infochimps is one such network graph with over 35 million edges. It represents the number of flights and passengers transported between two domestic airports in a given month. Go ahead and download it.

Explore Data

We’ve got to actually look at the data before we can make any decisions about how to process it and what questions we’d like answered:


$: head data/flights_with_colnames.tsv | wu-lign
origin_airport destin_airport passengers flights month
MHK AMW 21 1 200810
EUG RDM 41 22 199011
EUG RDM 88 19 199012
EUG RDM 11 4 199010
MFR RDM 0 1 199002
MFR RDM 11 1 199003
MFR RDM 2 4 199001
MFR RDM 7 1 199009
MFR RDM 7 2 199011

So it’s exactly what you’d expect; An adjacency list with (origin node,destination node,weight_1,weight_2,timestamp). There are thousands of data sets with similar characteristics…

Ask A Question

A simple question to ask (and probably the first question you should ask of a graph) is what the degree distribution is. Notice there are two flavors of degree in our graph:

1. Passenger Degree: For a given airport (node in the graph) the number of passengers in + the number of passengers out. Passengers in is called the ‘in degree’ and passengers out is (naturally) called the ‘out degree’.

2. Flights Degree: For a given airport the number of flights in + the number of flights out.

Let’s write the question wukong style:


#!/usr/bin/env ruby

require 'rubygems'
require 'wukong'

class EdgeMapper < Wukong::Streamer::RecordStreamer
#
# Yield both ways so we can sum (passengers in + passengers out) and (flights
# in + flights out) individually in the reduce phase.
#
def process origin_code, destin_code, passengers, flights, month
yield [origin_code, month, "OUT", passengers, flights]
yield [destin_code, month, "IN", passengers, flights]
end
end

class DegreeCalculator < Wukong::Streamer::AccumulatingReducer
#
# What are we going to use as a key internally?
#
def get_key airport, month, in_or_out, passengers, flights
[airport, month]
end

def start! airport, month, in_or_out, passengers, flights
@out_degree = {:passengers => 0, :flights => 0}
@in_degree = {:passengers => 0, :flights => 0}
end

def accumulate airport, month, in_or_out, passengers, flights
case in_or_out
when "IN" then
@in_degree[:passengers] += passengers.to_i
@in_degree[:flights] += flights.to_i
when "OUT" then
@out_degree[:passengers] += passengers.to_i
@out_degree[:flights] += flights.to_i
end
end

#
# For every airport and month, calculate passenger and flight degrees
#
def finalize

# Passenger degrees (out, in, and total)
passengers_out = @out_degree[:passengers]
passengers_in = @in_degree[:passengers]
passengers_total = passengers_in + passengers_out

# Flight degrees (out, in, and total)
flights_out = @out_degree[:flights]
flights_in = @in_degree[:flights]
flights_total = flights_in + flights_out

yield [key, passengers_in, passengers_out, passengers_total, flights_in, flights_out, flights_total]
end
end

#
# Need to use 2 fields for partition so every record with the same airport and
# month land on the same reducer
#
Wukong::Script.new(
EdgeMapper,
DegreeCalculator,
:partition_fields => 2 # use two fields to partition records
).run

Don’t panic. There’s a lot going on in this script so here’s the breakdown (real gentle like):

Mapper

Here we’re using wukong’s RecordStreamer class which reads lines from $stdin and splits on tabs for us already. That’s how we know exactly what arguments the process method gets.

Next, as is often the case with low level map-reduce, we’ve got to be a bit clever in the way we yield data in the map. Here we yield the edge both ways and attach an extra piece of information (“OUT” or “IN”) depending on whether the passengers and flights were going into the airport in a month or out. This way we can distinguish between these two pieces of data in the reducer and process them independently.

Finally, we’ve carefully rearranged our records such that (airport,month) is always the first two fields. We’ll partition on this as the key. (We have to say that explicitly at the bottom of the script)

Reducer

We’ve seen all these methods before except for one. The reducer needs to know what fields to use as the key (it defaults to the first field). Here we’ve explicitly told it to use the airport and month as the key with the ‘get_key’ method.

* start! – Here we initialize the internal state of the reducer with two ruby hashes. One, the @out_degree will count up all the passengers and flights out. The @in_degree will do the same but for passengers and flights in. (Let’s all take a moment and think about how awful and unreadable that would be in java…)

* accumulate – Here we simply look at each record and decide which counters to increment depending on whether it’s “OUT” or “IN”.

* finalize – All we’re doing here is taking our accumulated counts, creating the record we care about, and yielding it out. Remember, the ‘key’ is just (airport,month).

Get An Answer

We know how to put the data on the hdfs and run the script by now so we’ll skip that part. Here’s what the output looks like:


$: hdp-catd /data/domestic/flights/degree_distribution | head -n20 | wu-lign
1B1 200906 1 1 2 1 1 2
ABE 200705 0 83 83 0 3 3
ABE 199206 0 31 31 0 1 1
ABE 200708 0 904 904 0 20 20
ABE 200307 0 91 91 0 2 2
ABE 200703 0 36 36 0 1 1
ABE 199902 0 84 84 0 1 1
ABE 200611 0 753 753 0 18 18
ABE 199209 0 99 99 0 1 1
ABE 200702 0 54 54 0 1 1
ABE 200407 0 98 98 0 1 1
ABE 200705 0 647 647 0 15 15
ABE 200306 0 27 27 0 1 1
ABE 200703 0 473 473 0 11 11
ABE 200309 0 150 150 0 1 1
ABE 200702 0 313 313 0 8 8
ABE 200103 0 0 0 0 1 1
ABE 199807 0 105 105 0 1 1
ABE 199907 0 91 91 0 1 1
ABE 199501 0 50 50 0 1 1

At this point is where you might bring this back down to your local file system, crack open a program like R, make some plots, etc.

And we’re done for now. Hurray.

Access the Infochimps Query API via commandline

A tutorial on how to use chimps to access the Infochimps Query API via commandline.

  1. Sign up for the API
  2. When you get your API key, create your chimps dotfile: sudo nano ~/.chimps
  3. Put this in your dotfile:
    :query:
          :username: your_api_name
          :key:      you_api_key
    
    
  4. Install chimps: sudo gem install chimps. (make sure you have gemcutter as a source otherwise it won’t find the gem: gem sources -a http://gemcutter.org)
  5. Run a query! % chimps query soc/net/tw/influence screen_name=infochimps

It should return with something like this:

{"replies_out":13,"account_age":602,"statuses":166,"id":15748351,"replies_in":22,"screen_name":"infochimps"}

That’s it!