Data Mine

Why Real-Time Analytics? [Free White Paper]

realtime analytics Why Real Time Analytics? [Free White Paper]

When you think Big Data, the first words that come to mind are often Hadoop and NoSQL, but what do these technologies actually mean for your business?  Different Big Data technologies have different use cases where they work best.  For your real-time Big Data challenges often a very different class of tools must be implemented.

In this free white paper, we’ll explore:

  • How to create a flexible architecture that allows you to use the best Big Data tools and technologies for the job at hand
  • Where Hadoop analysis and NoSQL databases work and where they can fall short
  • How Hadoop differs from real-time analytics and stream processing approaches
  • Visual representations of how real-time analytics works and real world use cases
  • How to leverage the Infochimps Platform to perform real-time analytics

How to Build a Hadoop Cluster in 20 Minutes

If you’ve ever tried your hand at manually provisioning, configuring and deploying a Hadoop cluster, you know that it can take days or weeks to create a fully functional system. With tools like Chef, this time can be cut down to a matter of hours or days (depending on the size of the cluster). In this video, Dhruv Bansal, Chief Science Officer of Infochimps, builds a Hadoop cluster in 20 minutes with Ironfan.

Ironfan is the foundation for your Big Data stack, making provisioning and configuring your Big Data infrastructure simple. Spin up clusters when you need them, kill them when you don’t, so you can spend your time, money, and engineering focus on finding insights, not getting your machines ready. To learn more about how Ironfan enables The Infochimps Platform, check out our white paper.

Foursquare Venues, Wikipedia Articles, Census Data and More… All With Just an IP Address!

IMG 20110623 132455 1024x768 Foursquare Venues, Wikipedia Articles, Census Data and More... All With Just an IP Address!

Greetings from deep in the Data Mine here at Infochimps. This week the team rolled out new features that combine one of our most popular APIs with our Geo API platform, unlocking the ability to geolocate based on an IP Address with any of our Geo APIs.

The idea is based on one of our more popular mashups, our MaxMind GeoLite IP to Census API  which blends IP geolocation functionality with Census data. This allows you to find out not just where an IP address maps to, but also some high level information about that area – ideal for websites that do geotargeting and for people looking for a deeper understanding about their visitor audience. The data it draws on has become a bit dated though (it uses the 2000 Census), and the data covers a relatively narrow band of properties. Enter our Geo API platform, our platform for richer and more current data from a variety of sources.

A great advantage of our new Geo API platform is our ability to perform two-step queries internally, essentially converting a parameter into another parameter behind the scenes. It’s the key technology behind our ability to geolocate using an address: our geocoder first converts the address into latitude/longitude before making a secondary query against our data store to retrieve the response values.

By using the same principle with IP Geolocation instead of address geocoding, we have unlocked the ability for our users to query any of our Geo APIs with an IP Address as the geolocator, returning data as if the request had used a latitude/longitude. So now you can use an updated IP to Census API and also a more detailed drilldown version. Furthermore you can now go from IP to Foursquare Venue, Zillow Neighborhood, Wikipedia Article, and so on.

To use the new IP-Geolocation feature, just pass in the parameter g.ip_address with an IP address, along with a g.radius.  Check out this example query, which will help you locate banks and credit unions in our Foursquare database that are within 3 kms (about 1 mile) from the Infochimps office in Austin, TX.[YOUR API KEY HERE]

For client-side geo application developers we’ve also added another feature along with g.ip_address. With any of these APIs you can now pass “g.get_ip_address=true” instead, and our Geo API will determine the IP address of the machine calling our API and use that IP address as the geolocator. This new flag makes it easy to ask questions of our API like “tell me about venues near me” without ever having to know what your longitude is or how to interpret a quadkey.

All in the spirit of making Geo data more accessible and easy to use!

Where Does The Weather (Data) Come From? Visualizations of Worldwide Weather Stations

This post was written by Hohyon Ryu, who interned with us this past summer as a Catalog Engineer.  He’s currently pursuing a PhD at the University of Texas’ School of Information.

The idea for this project started from one of the simplest and most essential questions in computer science. How close is the nearest X from where I stand?  To explore how to answer this question, I used our NCDC Weather Station API and attempted to answer, “What is the closest weather station from where I stand”?

stations map left 1024x517 Where Does The Weather (Data) Come From?  Visualizations of Worldwide Weather Stations

A brute force algorithm that calculates distances from all the weather stations will have to go through 2.5 million weather stations. It works but it just takes long long time.

mapgrid Where Does The Weather (Data) Come From?  Visualizations of Worldwide Weather Stations

One better solution is dividing the earth by grids. We may divide the globe into small tiles and find the closest station in the grid that I’m standing in. This solution is very fast, but there’s another problem. In the map below, let’s say I’m standing in New Orleans. There are 2 stations: one in Baton Rouge and one in Slidell. The closest station would be the one in Slidell, it is in a different tile. So this algorithm would find the one in Baton Rouge as the closest point.

voronoi 1024x560 Where Does The Weather (Data) Come From?  Visualizations of Worldwide Weather Stations

So, came up with this solution, a Voronoi diagram for all the stations in the world! It looks like a very complex calculation should be involved to generate a map like the following, but it takes only a few minutes to build the world scale map with 2.5 million points. Each station has a polygon that indicates the range it covers.

The best solution for us was grid + Voronoi lattice. Now let’s go back to the New Orleans problem. We’re in New Orleans and it is in the grid that intersects with the Voronoi polygons of Slidell and Baron Rouge. So now, we know that we have 2 candidate stations and the one in Slidell is the closest one.

Want to try out making your own visualization of our weather station data?  You can find the NCDC Weather Stations API here and the Voronoi Lattice library written in Python is available at Github.

A Marketer Learns How to Program

Okay, I have a confession to make. Though I’ve helped write example queries, created wireframes for webpages and heck, even toyed around with our API Explorer, I’ve never written a single line of code. Not terribly surprising for a marketer; however, after hanging out with the chimps for the past six months, I felt like it was finally about time to learn something. But where to start? Would I really have the time to commit to a regular weekly class? Could I make myself curl up with an O’Reilly book on PHP or Ruby with other more pressing projects looming? Clearly, the only way I could make this a priority is for it to be fun!

codecademy A Marketer Learns How to Program

Meet Codecademy, which I’ve been nerding out on all morning.  In fact, in the last hour, I’ve learned how to define variables and starting working with strings, substrings, arrays and if/else.  The best part of the whole thing?  It’s been absolutely delightful.  Much of this comes from the amazingly simple layout, which takes students through progressive lessons on the basics of programming and JavaScript in a command line setting.  The initial concepts are easy to grok and they build beautifully on each other, until suddenly, whoa – did I just define variables, prompt a user for input and create different returns based on user input?

While, this might not sound like much to the hardcore programmers out there, for a marketer who had never written a lick of code in her life as of an hour ago, Codecademy is a pretty rad resource.  Share it with your programming-novice friends and see what they think.  Or, contribute to their lesson plans and get more folks coding!

Outfielders: Step Back from the Centroid

My little league coaches drilled it into us outfielders: “It’s harder to run backward than forward, so stand where you think the batter will probably hit, and then take a few large steps back.” (Full disclosure: my little league career wasn’t so illustrious.)

A reader of my last blog post about Ichiro’s hit locations seems to disagree with my coaches:

“[Based on Ichiro’s hit locations], this says to me that the traditional baseball positions are relatively optimal in terms of covering the field…”

To rehash, the June post, Clustering Baseball Data with Weka, gave an example of applying k-means clustering to Suzuki Ichiro’s 2006 x-y hit locations.

poointsandclusters Outfielders: Step Back from the Centroid

Left: The unprocessed hit locations—with outliers like home runs removed.
Right: The same set divided into six clusters using the k-means algorithm and the Euclidian distance. Centroids are red.


Mapping Tools for Developers

old world map Mapping Tools for Developers

This is a great time to be a geodeveloper. There’s more spatial data, geo-processing tools, geo enabled storage and mapping tools than ever.

Let’s start with storage – not too long ago geo developers had two choices, file formats or proprietary object-relational databases. Today there are production ready open source object-relational databases such as PostgreSQL/PostGIS and MySQL; even mobile devices have lightweight databases with spatial capabilities such as SQLite. In addition to traditional object-relational databases, NoSQL databases such as Cassandra, CouchDB, and MongoDB have a spatial capabilities. Big Table clones such as Hbase can also store spatial data and there is ongoing work for developing a spatial index which facilitates spatial queries and operations. Neo4J is a graph database that also handles spatial data. Finally, even full text search engines such as ElasticSearch provide geospatial search capabilities.


Getting Started with the Geo API: Sample Queries

31536 Lenox world globe Getting Started with the Geo API: Sample QueriesYesterday, we launched our new Geo API, creating easy access to millions of geographic data points, organized into one consistent, unified schema.  This new API allows you to ask questions of geo data the way you want; you can query by latitude/longitude/radius, zoom level, bounding box, quadkey and more!

Sounds great, but how do you get started?  We’ve pulled together some Example Queries to help our developers get started with the Geo API.  These sample queries are meant to help developers quickly and easily understand the structure of our API calls, so they can get to the fun part – getting answers to tough questions, building awesome visualizations & applications and having fun with our data!

Here are two examples of how to query our Geo API.  For more, please check out Example Queries and sign up for a free API key today!

Wikipedia articles

I’m building a day-tripper’s travel application. What are good sightseeing locales (based on places with Wikipedia articles) within a one hour drive (approximately 100km) of 105 E 5th St, Austin, TX 78701?[YOUR API KEY HERE]

For my aforementioned day-tripper travel application, I’m also interested in creating a kid’s section and the first thing I’d like to include are amusement parks near the Infochimps office in downtown Austin. (This example uses zoom level instead of radius.)[YOUR API KEY HERE]

We’ll continually improving our API documentation.  Please let us know in the comments if there’s anything in particular you’d like to see next.  Maybe it’s a quadkey calculator or code examples for Ruby and PHP or inclusion of specific data set.  We’re happy to get you what you want – just let us know what we can do!

Clustering Baseball Data with Weka

This is a guest blog post from Peter Hauck, who works as a data analyst at Google.  His experience includes employee compensation optimization and dynamic pricing of live event tickets. He is a graduate of Cornell University with a B.A. in Mathematics and Physics and an M. Eng in Applied Physics.

Greetings sports fans and data nerds! Since 2004, Major League Baseball has published (x,y) “hit locations” of every at bat and for years, Sabermetric and actuarial analysts have turned this and other data into predictions of where individual sluggers will hit in the future. In hopes of optimally positioning players in the field, professional teams and sports commentators pay handsomely for this kind of forecasting.  The models I’ve seen employ data binning and statistical & probabilistic models to get these results.

In a twist, using the GUI software, Weka, I applied k-means clustering to find patterns in single-season hits record holder, Ichiro Suzuki‘s (x,y) hit locations from 2006.  For readers not familiar, clustering is a computational method of splitting a dataset into neighborhoods of similar points.  Unlike most clustering work, using Weka avoids real programming; once the data was loaded into Weka, the computation required about ten mouse clicks.  I think of this as a semi-scientific, exploratory method that offers quick insight and often reliable conclusions.
sOF spray Clustering Baseball Data with Weka (more…)

Simple Flume Decorators with JRuby

Flume is a framework for moving chunks of data around on your network. It’s primary mission isto move log data from where it is generated (perhaps a web server) to someplace where it can actually be used – like an HDFS file system where it can be crunched by Hadoop. Flume’s design is very flexible – the final destination for your data could also be a database like HBase or Cassandra, a search index system like Elastic Search, another file system like an S3 bucket, or any of a myriad of other configurations. Flume will also go to some efforts to make sure that your data is delivered reliably – it includes some tunable reliability features out of the box.

The Flume User Guide does a good job of explaining how its component architecture works. You can configure data flows by chaining together systems of “nodes” – each node is a data moving unit – each has an input (“source”) and an output (“sink”). Sources can conceivably be anything that produces data – flume can tail sets of files on disk, listen to network ports, periodically run programs, etc. Sinks are a little more interesting – they can write data to disk, push data into an network connection, or into a database. Even more interesting, sinks can be composites – you can fan data out to many other sinks, or set up conditional sinks where if data fails to be accepted by the first sink, it will instead be routed to a second sink. Also, you can build “Decorators” that can modify the data as it moves down the path. Flume offers many sources, sinks, and decorators out of the box – but also gives you the ability to write your own through a Java-based plugin API.

Flume chops the data up into a series of “events”. For instance, if you are using flume to tail a log file, every line written to the file gets turned into a flume event. Each event carries with it the body of the data that produced it, as well as some meta-data: the machine that it was collected on, a time-stamp showing when the data was collected, the event’s priority, etc. You can even add your own key-value pairs to an event in its attributes. Flume sinks can store both the body data and the metadata of an event or in some cases, use the metadata to help ensure that the data lands in the right place – like with the “collectorSink” file bucketing mechanism.

To me, and to some of the other primates I work with at Infochimps, decorators are especially interesting. In a decorator, you get to do some processing on your data as it flies from wherever it was produced to its final destination(s). Flume comes with a handful of basic decorators that will allow you to do some small scale processing of flume events. For instance, the “value” decorator lets you set a value in the metadata of an event. The out-of-the-box decorators are not quite sufficient to handle my processing demands. I wanted a little more flexibility, so I wrote (in less time than it took me to write this blog entry) a quick interface to jRuby. Now I have access to my flume data in transit with a powerful scripting engine.

Enough with the intro – lets jump in. The following steps will lead you down the road to processing data on the fly through the flume with a jRuby script:

1. Install flume. Cloudera has good documentation on setting up Flume. I run Ubuntu, so I just added the cloudera apt package repository to my apt sources, and used “apt-get” to install the packages flume, flume-node and flume-master.

2. Get jRuby. If you use apt-get, you will be getting a slightly out-of-date version, but it will do for the moment. The jRuby website has more details If you need it.

3. Get my jRubyPlugin. For it to work, you have to have it and jruby.jar in Flume’s classpath. You can make custom adjustments to Flume’s runtime environment, including Flume classpath changes in the script in flume/bin. The easy way is to just drop jruby-flume.jar in the /usr/lib/flume/lib directory (or wherever flume landed in your install process). Getting your jRuby envornment completely set up so that you can see jruby gems and stuff is going to involve making adjustments to your environment, but for now, just having the jruby.jar on the classpath will work. I just created a symbolic link to /usr/lib/jruby/lib/jruby.jar in /usr/lib/flume/lib.

( Aside: I don’t know the full answer to getting everything jruby set up in an embedded mode. However, if you add the following to your script, you will be at least part of the way there
export UOPTS=”-Djruby.home=/usr/lib/jruby -Djruby.lib=/usr/lib/jruby/lib -Djruby.script=jruby”

4. You have to tell flume explicitly what classes to load as plugins when the nodes start up. To do this, create or edit “flume-site.xml” in the flume/conf directory. It should contain at least the following:
< ?xml version="1.0"?>
< ?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

List of plugin classes to load.

After you get this in place, restart your flume-master and flume-node. If everything went ok, the services will start up, and you can go to http://localhost:35871/masterext.jsp to see if the plugin loaded successfully. If you see “jRubyDecorator” listed under the decorators, you are in business.

5. Ok, now let’s build a decorator that does something simple. Create a directory somewhere to keep ruby scripts for flume – I like /usr/lib/flume/scripts. The files in this directory need to be readable by the user that flume is running as. Also, if you are in a distributed world, scripts are going to have to be available both on the master and on the machine that will house the logical node that will run the script.

Here is a simple script. Put it in /usr/lib/flume/scripts/reverse.rb:
# reverse.rb — jRubyDecorator script
require ‘java’
java_import ‘com.cloudera.flume.core.EventSinkDecorator’
java_import ‘com.cloudera.flume.core.Event’
java_import ‘com.cloudera.flume.core.EventImpl’
class ReverseDecorator < EventSinkDecorator def append(e) body = String.from_java_bytes e.getBody super body.reverse.to_java_bytes, e.getTimestamp, e.getPriority, e.getNanos, e.getHost, e.getAttrs ) end end What does it do? Well, it defines a subclass of com.cloudera.flume.core.EventSinkDecorator which redefines the append method. Our special append method builds a new event from an appended event, except that the text of the "body" field is reversed. Not too much nonsense, but we do have to be a little careful with strings. Flume likes its data to be represented as arrays of bytes, but ruby would prefer to deal with strings as Strings, so I convert both ways: String.from_java_bytes() to get a string object, and the to_java_bytes() method on string-like objects to convert back. Hidden in there, is the ruby string method "reverse". The last line of the append method shows off some of the power of jRuby. It creates a new instance of EventImpl and passes it off to EventSinkDecorator's implementation of append - basically letting the parent class handle all of the difficult work. Finally, the last line of the script instantiates a new object of the (jRuby!) ReverseDecorator class and returns it to jRubyDecorator. jRubyDecorator is really a factory class for producing decorator instances. It passes off our stuff as a java object, and flume never suspects what has happened. Does it work? Lets see: chris@basqueseed:~$ flume shell -c localhost 2011-02-23 17:35:03,785 [main] INFO conf.FlumeConfiguration: Loading configurations from /etc/flume/conf Using default admin port: 35873 Using default report port: 45678 Connecting to Flume master localhost:35873:45678... 2011-02-23 17:35:03,993 [main] INFO util.AdminRPCThrift: Connected to master at localhost:35873 ================================================== FlumeShell v0.9.3-CDH3B4 Copyright (c) Cloudera 2010, All Rights Reserved ================================================== Type a command to execute (hint: many commands only work when you are connected to a master node) You may connect to a master node by typing: connect host[:adminport=35873[:reportport=45678]] [flume localhost:35873:45678] exec config basqueseed console '{jRubyDecorator("/usr/lib/flume/scripts/reverse.rb")=>console}’
[id: 0] Execing command : config
Command succeeded
[flume localhost:35873:45678] quit
So far, so good – the master node has decided that everything is kosher. By the way, be careful with the single and double quotes in flume shell commands. The flume shell is very picky about its input. If you have any structure to your sources or sinks, you must single quote the declaration. Lets now play with with a node:
chris@basqueseed:~$ sudo /etc/init.d/flume-node stop
[sudo] password for chris:
Stopping Flume node daemon (flume-node): stopping node

chris@basqueseed:~$ sudo -u flume flume node_nowatch
/2011-02-23 17:38:21,709 [main] INFO agent.FlumeNode: Flume 0.9.3-CDH3B4
2011-02-23 17:38:21,710 [main] INFO agent.FlumeNode: rev 822c62f0c13ab76921e96dd92e19f68007dbcbe2
2011-02-23 17:38:21,710 [main] INFO agent.FlumeNode: Compiled on Mon Feb 21 13:01:39 PST 2011
…{stuff deleted}…
2011-02-23 17:39:40,471 [logicalNode basqueseed-20] INFO console.JLineStdinSource: Opening stdin source
?was I tam a ti saW
2011-02-23 17:39:45,720 [logicalNode basqueseed-20] INFO debug.ConsoleEventSink: ConsoleEventSink( debug ) opened
basqueseed [INFO Wed Feb 23 17:39:45 CST 2011] Was it a mat I saw?
Holy chute! It works!

I think that is enough for today. Next time, I’ll try some more complicated scripts, deal with attributes and play some games with data flows.