The Infochimps Blog

Big data insights, news, and tips straight from the Data Mine

News & Interviews

Big Data Predictions for 2012 – Part One

A couple of days ago, O’Reilly’s Edd Dumbill took a look at hot topics in data for the coming year.  His five big data predictions for 2012 were absolutely spot on and illustrate the most important next steps in our ability to find insight in data: visualization, interactivity and abstraction.  As CTO of a tech startup at the center of a lot of these conversations, I thought I’d share some of my thoughts on the matter as well.  A full post with my 2012 big data predictions will be out in the next couple of weeks.

Moving Towards Insight, Not Data

Nobody wants data. It’s costly to store and bothersome gather, parse and make useful. What you do want is insight — the ability to see deeper and make better decisions.  However, even these tools of the future can only take your internal data so far, though. At some point, it becomes important to widen your observation field to include explanatory variables to give your data context.  In other words, at some point, it’s useful to bring in outside data to give your own internal data more shape and meaning.

Explanatory variables give your data context, which leads you to insight.

Data marketplaces such as Infochimps, Factual and Microsoft Azure will take off as CIOs discover that BI tools are for generating questions, not answers.  For example, pretend you are a camping equipment store based in Reno, NV with an e-commerce site.  After years of business, you understand your yearly sales cycle (peaks in the early spring and early fall), who your repeat customers are, who your local competitors are, etc.  However, you may uncover curiosities that cannot be answered by internal data: why do sales peak in early to mid-August and why do so many folks request holds during that time?  Why does your customer spread suddenly change from mostly folks in the Nevada area to folks from all over the country? Your business intelligence tools may fall short here.  Explanatory variables from outside sources, including weather, popularity of competitors (Foursquare check-ins, web traffic, etc) and major festivals in your area (oh HAI Burning Man!) may help you get to the bottom of why your store suddenly becomes flooded with neon furry leg-warmer wearing hippies around Labor Day.

In other words, your data creates questions and outside data creates context and helps you answers questions.  In short, the answer to the too much data problem is more data.

Real-time Insight

In his article, Edd talks about the idea of streaming data processing. I haven’t gotten to play with Esper yet, but Ilya Grigorik has — as he puts it, Esper lets you “store queries not data, process each event in real-time, and emit results when some query criteria is met.”  You can read his overview here.

At a lower and simpler level, we here at Infochimps LOVE Flume. It does one thing, simply and well: gets data from over here to over there, perhaps doing things to it along the way. It was developed for reliable log handling, but there are so many wonderful ways to beautifully misuse Flume:

  • We replaced a set of massive batch jobs to parse scraped data with a simple Flume decorator that turns raw JSON into database-ready records. The new data flow is not only near-real-time, but also simpler, more reliable, and more maintainable.
  • If you’re shipping your weblogs around with Flume, ten lines of Ruby and a tiny little StatsD daemon are all you need to track in real-time the rate, response code and latency of every web request. Joining the Church of Graphs has never been easier.
  • Trying to scale a distributed message queue (RabbitMQ, Resque, etc)? If you care about throughput more than latency, look carefully at Flume — you’re probably better off.

By the way, if timeseries are your bread and butter, check out our friends at Eidosearch.

Data Science Workflows and Tools for Fast Iteration

Edd describes how Hadoop’s batch-oriented processing to be sufficient for many uses, but argues that batch processing isn’t adequate for online and real-time needs.  Spark/Mesos has great promise for the kind of interactive exploration Edd describes. (It’s as close as anyone’s gotten to a REPL for big data exploration.)

At infochimps, our workflow revolves around writing short, readable scripts.  See our chapter in O’Reilly’s Hadoop the Definitive Guide for a case study on how we developed a toolkit for Hadoop streaming in a Ruby environment.

Visualization is King

The most important machine learning algorithms all, at their heart, exploit a graph of one sort or another. Yet simply seeing, in any form, a graph of significant scale is still an oafish RAM-exhausting slog. Far and away the best tool we’ve used is Gephi, an open source interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.

So, how essential is visualization, especially graph visualization, to quality data science?  LinkedIn, home of one of the top data science teams in the world, used its consistently-brilliant/outrageously-unfair hiring advantage to bring aboard Mathieu Bastian, Gephi’s author. LinkedIn derives so much value in just being able to see and understand their data better that they’re funding the full-time development of Gephi, to the benefit of us all.

On Being Wrong In Paris: Finding Truth in Wrong Answers

Here’s a problem that’s harder than it seems: Where is Paris? Any simple response proves more ambiguous and brittle than you would expect. But across an ocean of data lies a new way to discover answers, one that accommodates complexity because it is sourced in complexity.

A reasonable answer is “the political boundary of the city of Paris, France”. If you were walking through France, and were prone to making silly graphs, you might draw this plot of “How much am I in Paris?”

(more…)

Flip Kromer Interview at Strata NY 2011

Check out this interview with Flip Kromer, founder and CTO (yes, that’s a typo in the O’Reilly video) at Strata NY 2011.  He shares his thoughts on how to do data science on a shoestring, the tools of the trade and the future of DaaS (data as a service).

Infochimps Acquires Keepstream

We are excited to announce that Keepstream, social media curation and analytics company, will be joining the Infochimps family!  Jim England, Huston Hoburg, and myself, Tim Gasper are excited to now be Chimps. Together we’ll continue to develop exciting new data products and rock the world of Big Data.

With this announcement, we have two pieces of product news. Firstly, the Keepstream hand-curation product will slowly close down. The website will be set to “read-only mode” on September 30th. New user registration will be turned off and existing users will no longer be able to create or edit their collections. However, all collections will still be hosted online and be accessible at keepstream.com for viewing. So don’t worry, those links you’ve shared will still work just fine. We will be exploring options for exporting collections or integrating with similar curation services. If you’d be interested in something like this, leave me a note at tim[at]keepstream[dot]com.

Secondly, Keepstream Reports, located at http://reports.keepstream.com, will continue forward. It has the potential to be both a more automated way to archive social media as well as a way to create beautiful, actionable social media analytics reports. It’s currently in private beta.

We are really excited about we can accomplish together, you can look forward to many exciting developments soon to come!  Also, Jim and I will be attending TechCrunch Disrupt in San Francisco next week.  If you’ll be there too, we’d love to meet you.  Just look for us in our brand new Infochimps shirts or shoot us a tweet at @JimEngland or @TimGasper.

The Howler Project: Get Loud!

For howler monkeys, as their name suggests, loud vocal communication is an integral part of their social behavior. Howler monkeys are widely considered to be the loudest land animal and according to Guinness Book of World Records, their vocalizations can be heard clearly for 3 miles (4.8 km). Curious what they sound like? You can hear a clip of the howler monkey’s signature low growl here.

Wanna get loud?

We’re launching a beta testing program to pull in feedback from our most vocal fellow primates on our newest APIs.  We’re calling it the Howler Project and you can apply to become one of our first Howlers by clicking here.  We’ll email you periodically to let you know when we’re releasing new APIs and how to access them before we announce them to the general public.  Each API will have a short series of tests/questions we’d like you to answer, though you’re welcome to try to break it in any way you’d like.  The more feedback we get from you, the better we can make our products.  And the better our APIs, the easiest and faster it can be to start building amazing things on top of our data.  Everyone wins and you can help us get there!

So what’s in it for you, our furry friends?

Well, in true chimp spirit, if you pick bugs off our backs, you get to enjoy tasty grubs.  Help us test just one API and we’ll send you some fun chimpy swag.  Become a regular tester and the incentives grow to include free account upgrades on infochimps.com, sweet toys from ThinkGeek, custom Startup: The Hackering cards with your info and more.

Apply today and get loud!

If Search Still Sucks, Here’s How to Fix It

In a controversial post that was countered by Matt Cutts of Google, Michael Arrington compared the modern search experience to what it was like before Google was created, claiming, “It’s a sh*t show of layer upon layer of SEO madness vying for my click”.

“Yes, search is very hard. But Silicon Valley is really good at doing hard things,” claims Arrington.

Here at Infochimps, we have never shied away from difficult problems. We are a company of 12 people handling data stores that companies 100 times our size never see. Soon, Infochimps will launch hundreds of more data sets and data APIs. What do data APIs have to do with Google not being effective though?

In the post, Arrington describes that when he knows what he is looking for such as a vacation, he goes straight to vacation website such as TripAdvisor or Gogobot. Sites like TripAdvisor or Yelp start with data though, either through mass aggregation or subscription to expensive services. Aggregation is not easy though. It equates to hours of scraping, cleaning, parsing, and updating data. There are scraping scripts that could potentially break. It also means potentially thousands of dollars of hosting costs to house and process that data. Search is not a time effective or cheap exercise.

If you can launch an app and plug into a data API in minutes, suddenly creating targeted vertical search engines becomes easy and more affordable. You are in essence sharing data set with others, which makes creating intelligent algorithms with that data a lot easier and more affordable.

There will always be a place for a major search engine like Google. Vertical search engines are content rich and can become “the search within the search” though. We are doing what we can to make those more accessible to everyone, so hopefully we can make Mr. Arrington and others happy in their quest to find things.

By the way, we solve difficult problems here in Austin too. ;-)

Interesting Article on Factual (With Nod to Infochimps)

Factual is very ambitious and we share their desire to “liberate the world’s data”. That being said, they are building an open-source database and we are building a frictionless data marketplace. These are two different things, and don’t preclude us from working together towards our shared desire. If we are successful in disrupting the $100 billion data services market, maybe the first sentence in the article below will some day contain names like Jacob Perkins, Joe Kelly, Dhruv Bansal, Flip Kromer, Hollyann Wood, Jesse Crouch, Kurt Bollacker, Michelle Greer, Dennis Yang, Chris Howe, Adam Seever, or heck, maybe even Nick Ducoff.

Read more about Factual at Wall Street Journal’s website.

Infochimps Founder Flip Kromer’s Interview on FounderBuzz

Infochimps has gone through quite a lot since beginning as a simple idea of becoming “the SourceForge of data”. We’ve graduated from a group working out of founder Flip Kromer’s house to a downtown Austin company with fifteen employees in two states. Learn more about Infochimps’s beginnings in this interview of Flip with Scott Olson from FounderBuzz:

Welcome Kurt Bollacker to the Infochimps Team!

I’m excited to announce that Kurt Bollacker is joining the team here at Infochimps as our consulting Data Scientist. I recently had the pleasure of working with Kurt on our analysis of Republican and Democrat words project, so I’m looking forward to working with him more on some awesome projects.

Kurt is also the Digital Research Director at the Long Now Foundation, which is a much respected group in San Francisco, focused on long term policy and thinking. Brian Eno, Esther Dyson and Stewart Brand are amongst its board members. Previously, Kurt was the Chief Scientist at Metaweb Technologies, which was acquired by Google this year.

Kurt and Flip first met at our first Data Cluster meetup in Austin at SXSW in 2009. Since then, they’ve continued to discuss and collaborate over Wukong, our open source tool that allows data engineers to write Ruby scripts to run big data processing on Hadoop. Kurt received his Ph.D. in Computer Engineering from UT, so he’s no stranger to Austin.

Here at Infochimps, Kurt will be joining our growing data team to design and spec our data pipeline. We ingest lots of data from many different sources, including web pages, regular FTP uploads by suppliers, and APIs, and that data then needs to go through a chain of processes before it is ready for distribution on our site or API. At Metaweb, Kurt had this exact experience in shipping large amounts of data around with Freebase, so we look forward to his vast expertise accelerating our efforts in this area.

Kurt, welcome to the team; we’re all excited to have you on board.

Infochimps Acquires DataMarketplace.com

We’re pleased to announce that Infochimps has acquired DataMarketplace.com. We’ve been admiring what they have been doing for awhile now, so we jumped at the opportunity when it presented itself. Datamarketplace.com is a Y-Combinator funded company, founded by Steve DeWald and Matt Hodan at the beginning of this year, with the original vision to be the “Amazon for structured information.”

When I met with Steve to chat about his vision, it was apparent that our two companies shared many of the same philosophies and visions for the future of Big Data. Hell, even our platforms are built upon many similar foundations and tools, like Ruby on Rails and Heroku. So, the transition has been smooth, and the site is already running on Infochimps servers.

From the Co-Founder and CTO of DataMarketplace.com, Steve DeWald:

Data is one of the most valuable assets in the world. We use it for decisions every day, and enormous industries are built around compiling and organizing it. It costs almost nothing to share, but despite that there is no single pervasive marketplace for buying and selling data. That’s the problem we tried solving with Data Marketplace.

It’s a problem because the fragmented nature of data creates friction for those wanting to share it. As a seller of data, there’s no easy or standardized way for to monetize it. Often times the expectation is to sell it in an expensive research report and have the raw data separately available by request. That’s fine, but that’s only capturing a fraction of a fraction of percent of all the useful data that people could be selling. Likewise there’s a lot of data people want to be selling that potential buyers can’t find. As a consumer of data, I often search on Google for the data I’m looking for, though frequently the data I want is behind a pay-wall and keywords are not being properly indexed for search. All these problems could be solved for the betterment of humanity with standardized and open marketplace for data.

Although Matt and I have moved on to other projects (I’m selling custom made suits online), I am happy to be putting our work in the hands of the talented team at InfoChimps, which has built the world’s largest open marketplace for data.

Thanks for the kind words, Steve.

We’re excited to integrate DataMarketplace.com into Infochimps. As Nick Ducoff, Infochimps CEO, says:

Just as Salesforce recently extended their brand with database.com, we’re extending ours with DataMarketplace.com, which fits well into our overarching strategy to be the destination on the web for data and data services.



Q & A’s relating to acquisition:

If I have uploaded my data to DataMarketplace.com, what’s going to happen with it?

Will it still be available for purchase, and will I receive my royalties? All datasets that are available on DataMarketplace.com will soon be available through Infochimps.com, and will continue to live on DataMarketplace.com. Customers will still be able to browse and purchase the data, and we will ensure that you receive your royalties from sales of that data.

What will happen to my user account on DataMarketplace.com?

Your account will still survive on DataMarketplace.com, and you will soon receive an email with details on how to login at Infochimps.com. It’s important for us to maintain the DataMarketplace.com community, and we will notify you of any changes to your account in as few emails as possible.

What will happen to the Data Requests on DataMarketplace.com?

We will continue to support the data requests feature on DataMarketplace.com, and we do not plan to remove or change any of the requests that are on the site at present. We will notify requesters of changes to their requests or their account.

Older posts »