Tag Archives for "wukong"

The Data Era – Moving from 1.0 to 2.0

“This post is from Jim Kaskade, the newly appointed CEO of Infochimps.  When we first met Jim, we were really impressed by him from multiple points of view.  His first questions about us were about our culture, something we pride ourselves on cultivating and would only want to work with an executive that shared the same concern.  Second, his understanding of the market and technological solutions matched, and in some areas exceeded, our own.  Third, Jim brings true leadership and CEO experience to the table, having been an executive and leading a number of startups in the past after a career at Teradata. We are truly excited to have Jim aboard and look forward to working together for many years!”

-Flip Kromer, Dhruv Bansal, and Joseph Kelly, co-founders of Infochimps

Do you think they truly understood just how fast the data infrastructure marketplace was going to change?

That is the question that comes to mind when I think about Donald Feinberg and Mark Beyer at Gartner who, last year, wrote about how the data warehouse market is undergoing a transformation. Did they, or anyone for that matter, understand the significant change underway in the data center? I describe it as Big Data 1.0 versus Big Data 2.0.

Big Data 1.0

1 The Data Era – Moving from 1.0 to 2.0

I was recently talking to friends at one of our largest banks about their Big Data projects under way. In less than one year, their Hadoop cluster has already far exceeded their Teradata enterprise data warehouse in size.

Is that a surprise? Not really. When you think about it, a traditionally large data warehouse is always in the terabytes, not petabytes (well, unless you are eBay).

With the current “Enterprise Data Warehouse” (EDW) framework (shown here) we will always see the high-value structured data in the well-hardened, highly available and secure EDW RDBMS (aka Teradata).

In fact, Gartner defines a large EDW starting at 20TB. This is why I’ve held back from making comments like, “Teradata should be renamed to Yottadata.” After all, it is my “alma mater” after having spent 10 years learning Big Data 1.0 there. I highly respect the Teradata technology and more importantly the people.

Big Data 2.0

So with over two zettabytes of information being generated in 2012 alone, we can expect more “Big Data” systems to be stood up, new breakthroughs in large dataset analytics, and many more data-centric applications being developed for businesses.

2 The Data Era – Moving from 1.0 to 2.0

However, many of the “new systems” will be driven by “Big Data 2.0” technology. The enterprise data warehouse framework itself doesn’t change much. However, there are many, many new players – mostly open source, who have entered the scene.

Examples include:

  • Talend for ETL
  • Cloudera, Hortonworks, MapR for Hadoop
  • SymmetricDS for replication
  • Hbase, Cassandra, Redis, Riak, Elastic Search, etc. for NoSQL / NewSQL data stores
  • ’R’, Mahout, Weka, etc. for machine learning / analytics
  • Tableau, Jaspersoft, Pentaho, Datameer, Karmasphere, etc. for BI

These are so many new and disruptive technologies, each contributing to the evolution of the enterprise’s data infrastructure.

I haven’t mentioned one of the more controversial statements made in adjacent graphic – Teradata is becoming a source along side the new pool of unstructured data. Both the new and the old data are being aggregated into the “Big Data Warehouse”.

We may also be seeing much of what Hadoop does in ETL feeding back into the EDW. But I suspect that this will become less significant as compared to the new analytics architecture with Hadoop + NoSQL/NewSQL data stores at the core of the framework – especially as this new architecture becomes more hardened and enterprise class.

Infochimps’ Big Data Warehouse Framework

3 The Data Era – Moving from 1.0 to 2.0

This leads us to why Infochimps is so well positioned to make a significant impact within the marketplace.

By leveraging four years of experience and technology development in cloud-based big data infrastructure, the company is now offering a suite of products that contribute to each part of Big Data Warehouse Framework for enterprise customers.

DDS: With Infochimps’ Data Delivery Services (DDS), our customer’s application developers do not rely on sophisticated ETL tools. But rather, they can manipulate data streams of any volume or velocity using DDS through a simple developer-friendly language, referred to as Wukong. Wukong turns application developers into data scientists.

Ingress and egress can be handled directly by the application developer, uniquely bridging the gap between them and their data.

Wukong: Wukong is much more than a data-centric domain specific language (DSL). With standardized connectors to analytics from ‘R’, Mahout, Weka, and others, not only is data manipulation made easy, integration of sophisticated analytics with the most complicated data sources is also made easy.

Hadoop & NoSQL/NewSQL Data Stores: At the center of the framework, is not only an elastic and cloud-based Hadoop stack, but a selection of NoSQL/NewSQL data stores as well. This uniquely positions Infochimps to address both decision support-like workloads, which are complex and batch in nature, with OLTP or more real-time workloads as well. The complexities of standing up, configuring, scaling, and managing these data stores is all automated.

Dashpot: The application developer is typically left out with many of the business intelligence tools offered today. This is because most tools are extremely powerful and built for special groups of business users / analysts. Infochimps has taken a slightly different approach, staying focused on the application developer. Dashpot is a reporting and analytics dashboard which was built for the developer – enabling quick iteration and insights into the data, prior to production and prior to the deployment of more sophisticated BI tools.

Ironfan and Homebase: As the underpinning of the Infochimps solution, Ironfan and Homebase are the two solutions which essentially abstract any and all hardware and software deployment, configuration, and management. Ironfan is used to deploy the entire system into production. Homebase is used by application developers to create their end-to-end data flows and applications locally on their laptops or desktops before they are deployed into QA, staging, and/or production.

All-in-all Infochimps has taken a very innovative approach to enabling application developers with Big Data 2.0 technologies in a way that is not only comprehensive, but fast, simple, extensible, and safe.

Our vision for Infochimps leverages the power of Big Data, Cloud Computing, Open Source, and Platform as a Service – all extremely disruptive technology forces. We’re excited to be helping our customers address their mission critical questions, with high impact answers. And I personally look forward to executing on our vision to provide the simplest yet most powerful cloud-based and completely managed big data service for our enterprise customers.

blog platform demo v21 The Data Era – Moving from 1.0 to 2.0

Flip Kromer Interview at Strata NY 2011

Check out this interview with Flip Kromer, founder and CTO (yes, that’s a typo in the O’Reilly video) at Strata NY 2011.  He shares his thoughts on how to do data science on a shoestring, the tools of the trade and the future of DaaS (data as a service).

Intro to Wukong, a Ruby Framework for Hadoop

As Flip Kromer was quoted at the Strata Conference, “Java has many many virtues, but joy is not one of them”. A lot of developers might not think they can use Hadoop simply because they never learned or refuse to use Java.

Wukong allows you to leverage the agility and ease of use of Ruby with Hadoop. The same program you write on your machine can be deployed to the cloud.

In this video at Data Day Austin, Infochimps CTO Flip Kromer walks through how you can get started with Wukong.

Many thanks to Lynn Bender at GeekAustin for filming, and DataStax for sponsoring. You can find more videos from Data Day at this Blip Channel.

Installing Wukong

You can find more posts on big data, Hadoop, Pig, and Wukong at my personal blog, Data Recipes.
(The Data Chef)

Wukong is hands down the simplest (and probably the most fun) tool to use with hadoop. It especially excels at the following use case:

You’ve got a huge amount of data (let that be whatever size you think is
huge). You want to perform a simple operation on each record. For
example, parsing out fields with a regular expression, adding two fields
together, stuffing those records into a data store, etc etc. These are
called map only jobs. They do NOT require a reduce. Can you imagine
writing a java map reduce program to add two fields together? Wukong
gives you all the power of ruby backed by all the power (and
parallelism) of hadoop streaming. Before we get into examples, and there
will be plenty, let’s make sure you’ve got wukong installed and running

Installing Wukong

First and foremost you’ve got to have ruby installed and running on your
machine. Most of the time you already have it. Try checking the version
in a terminal:

$: ruby --version
ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]

If that fails then I bet google can help you get ruby installed on
whatever os you happen to be using.

Next is to make sure you’ve got rubygems installed

$: gem --version

Once again, google can help you get it installed if you don’t have it.

Wukong is a rubygem so we can just install it that way:

sudo gem install wukong
sudo gem install json
sudo gem install configliere

Notice we also installed a couple of other libraries to help us out (the
json gem, the configliere gem, and the extlib gem). If at any time you
get weird errors (LoadError: no such file to load — somelibraryname)
then you probably just need to gem install somelibraryname.

An example

Moving on. You should be ready to test out running wukong locally now.
Here’s the most minimal working wukong script I can come up with that
illustrates a map only wukong script:

#!/usr/bin/env ruby

require 'rubygems'
require 'wukong'

class LineMapper < Wukong::Streamer::LineStreamer
 def process line
   yield line

Wukong::Script.new(LineMapper, nil).run

Save that into a file called wukong_test.rb and run it with the

cat wukong_test.rb | ./wukong_test.rb

If everything works as expected then you should see exactly the contents
of your script dump onto your terminal. Lets examine what’s actually
going on here.

Boiler plate ruby

First, we’re letting the interpreter know we want to use ruby with the
first line (somewhat obvious). Next, we’re including the libraries we

The guts

Then we define a class in ruby for doing our map job called LineMapper.
This guy subclasses from the wukong LineStreamer class. All the
LineStreamer class does is simply read records from stdin and gives them as arguments to
the LineMapper’s process method. The process method then does nothing
more than yield the line back to the LineStreamer which emits the line
back to stdout.

The runner

Finally, we have to let wukong know we intend to run our script. We
create a new script object with LineMapper as the mapper class and nil as the reducer class.

More succinctly, we’ve written our own cat program. When we ran the
above command we simply streamed our script, line by line, through the
program. Try streaming some real data through the program and adding
some more stuff to the process method. Perhaps parsing the line with a
regular expression and yielding numbers? Yielding words? Yielding
characters? The choice is yours. Have fun with it.

Meatier examples to come.