- January 25, 2011
You can find more posts on big data, Hadoop, Pig, and Wukong at my personal blog, Data Recipes.
(The Data Chef)
Wukong is hands down the simplest (and probably the most fun) tool to use with hadoop. It especially excels at the following use case:
You’ve got a huge amount of data (let that be whatever size you think is
huge). You want to perform a simple operation on each record. For
example, parsing out fields with a regular expression, adding two fields
together, stuffing those records into a data store, etc etc. These are
called map only jobs. They do NOT require a reduce. Can you imagine
writing a java map reduce program to add two fields together? Wukong
gives you all the power of ruby backed by all the power (and
parallelism) of hadoop streaming. Before we get into examples, and there
will be plenty, let’s make sure you’ve got wukong installed and running
First and foremost you’ve got to have ruby installed and running on your
machine. Most of the time you already have it. Try checking the version
in a terminal:
$: ruby --version ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]
If that fails then I bet google can help you get ruby installed on
whatever os you happen to be using.
Next is to make sure you’ve got rubygems installed
$: gem --version 1.3.7
Once again, google can help you get it installed if you don’t have it.
Wukong is a rubygem so we can just install it that way:
sudo gem install wukong sudo gem install json sudo gem install configliere
Notice we also installed a couple of other libraries to help us out (the
json gem, the configliere gem, and the extlib gem). If at any time you
get weird errors (LoadError: no such file to load — somelibraryname)
then you probably just need to gem install somelibraryname.
Moving on. You should be ready to test out running wukong locally now.
Here’s the most minimal working wukong script I can come up with that
illustrates a map only wukong script:
#!/usr/bin/env ruby require 'rubygems' require 'wukong' class LineMapper < Wukong::Streamer::LineStreamer def process line yield line end end Wukong::Script.new(LineMapper, nil).run
Save that into a file called wukong_test.rb and run it with the
cat wukong_test.rb | ./wukong_test.rb --map
If everything works as expected then you should see exactly the contents
of your script dump onto your terminal. Lets examine what’s actually
going on here.
Boiler plate ruby
First, we’re letting the interpreter know we want to use ruby with the
first line (somewhat obvious). Next, we’re including the libraries we
Then we define a class in ruby for doing our map job called LineMapper.
This guy subclasses from the wukong LineStreamer class. All the
LineStreamer class does is simply read records from stdin and gives them as arguments to
the LineMapper’s process method. The process method then does nothing
more than yield the line back to the LineStreamer which emits the line
back to stdout.
Finally, we have to let wukong know we intend to run our script. We
create a new script object with LineMapper as the mapper class and nil as the reducer class.
More succinctly, we’ve written our own cat program. When we ran the
above command we simply streamed our script, line by line, through the
program. Try streaming some real data through the program and adding
some more stuff to the process method. Perhaps parsing the line with a
regular expression and yielding numbers? Yielding words? Yielding
characters? The choice is yours. Have fun with it.
Meatier examples to come.