News & Interviews

Interesting Article on Factual (With Nod to Infochimps)

Factual is very ambitious and we share their desire to “liberate the world’s data”. That being said, they are building an open-source database and we are building a frictionless data marketplace. These are two different things, and don’t preclude us from working together towards our shared desire. If we are successful in disrupting the $100 billion data services market, maybe the first sentence in the article below will some day contain names like Jacob Perkins, Joe Kelly, Dhruv Bansal, Flip Kromer, Hollyann Wood, Jesse Crouch, Kurt Bollacker, Michelle Greer, Dennis Yang, Chris Howe, Adam Seever, or heck, maybe even Nick Ducoff.

Read more about Factual at Wall Street Journal’s website.

Infochimps Founder Flip Kromer’s Interview on FounderBuzz

Infochimps has gone through quite a lot since beginning as a simple idea of becoming “the SourceForge of data”. We’ve graduated from a group working out of founder Flip Kromer’s house to a downtown Austin company with fifteen employees in two states. Learn more about Infochimps’s beginnings in this interview of Flip with Scott Olson from FounderBuzz:

Welcome Kurt Bollacker to the Infochimps Team!

kurt 300x225 Welcome Kurt Bollacker to the Infochimps Team!

I’m excited to announce that Kurt Bollacker is joining the team here at Infochimps as our consulting Data Scientist. I recently had the pleasure of working with Kurt on our analysis of Republican and Democrat words project, so I’m looking forward to working with him more on some awesome projects.

Kurt is also the Digital Research Director at the Long Now Foundation, which is a much respected group in San Francisco, focused on long term policy and thinking. Brian Eno, Esther Dyson and Stewart Brand are amongst its board members. Previously, Kurt was the Chief Scientist at Metaweb Technologies, which was acquired by Google this year.

Kurt and Flip first met at our first Data Cluster meetup in Austin at SXSW in 2009. Since then, they’ve continued to discuss and collaborate over Wukong, our open source tool that allows data engineers to write Ruby scripts to run big data processing on Hadoop. Kurt received his Ph.D. in Computer Engineering from UT, so he’s no stranger to Austin.

Here at Infochimps, Kurt will be joining our growing data team to design and spec our data pipeline. We ingest lots of data from many different sources, including web pages, regular FTP uploads by suppliers, and APIs, and that data then needs to go through a chain of processes before it is ready for distribution on our site or API. At Metaweb, Kurt had this exact experience in shipping large amounts of data around with Freebase, so we look forward to his vast expertise accelerating our efforts in this area.

Kurt, welcome to the team; we’re all excited to have you on board.

Infochimps Acquires

We’re pleased to announce that Infochimps has acquired We’ve been admiring what they have been doing for awhile now, so we jumped at the opportunity when it presented itself. is a Y-Combinator funded company, founded by Steve DeWald and Matt Hodan at the beginning of this year, with the original vision to be the “Amazon for structured information.”

When I met with Steve to chat about his vision, it was apparent that our two companies shared many of the same philosophies and visions for the future of Big Data. Hell, even our platforms are built upon many similar foundations and tools, like Ruby on Rails and Heroku. So, the transition has been smooth, and the site is already running on Infochimps servers.

From the Co-Founder and CTO of, Steve DeWald:

Data is one of the most valuable assets in the world. We use it for decisions every day, and enormous industries are built around compiling and organizing it. It costs almost nothing to share, but despite that there is no single pervasive marketplace for buying and selling data. That’s the problem we tried solving with Data Marketplace.

It’s a problem because the fragmented nature of data creates friction for those wanting to share it. As a seller of data, there’s no easy or standardized way for to monetize it. Often times the expectation is to sell it in an expensive research report and have the raw data separately available by request. That’s fine, but that’s only capturing a fraction of a fraction of percent of all the useful data that people could be selling. Likewise there’s a lot of data people want to be selling that potential buyers can’t find. As a consumer of data, I often search on Google for the data I’m looking for, though frequently the data I want is behind a pay-wall and keywords are not being properly indexed for search. All these problems could be solved for the betterment of humanity with standardized and open marketplace for data.

Although Matt and I have moved on to other projects (I’m selling custom made suits online), I am happy to be putting our work in the hands of the talented team at InfoChimps, which has built the world’s largest open marketplace for data.

Thanks for the kind words, Steve.

We’re excited to integrate into Infochimps. As Nick Ducoff, Infochimps CEO, says:

Just as Salesforce recently extended their brand with, we’re extending ours with, which fits well into our overarching strategy to be the destination on the web for data and data services.

Q & A’s relating to acquisition:

If I have uploaded my data to, what’s going to happen with it?

Will it still be available for purchase, and will I receive my royalties? All datasets that are available on will soon be available through, and will continue to live on Customers will still be able to browse and purchase the data, and we will ensure that you receive your royalties from sales of that data.

What will happen to my user account on

Your account will still survive on, and you will soon receive an email with details on how to login at It’s important for us to maintain the community, and we will notify you of any changes to your account in as few emails as possible.

What will happen to the Data Requests on

We will continue to support the data requests feature on, and we do not plan to remove or change any of the requests that are on the site at present. We will notify requesters of changes to their requests or their account.

Hadoop World 2010 & New Propaganda

Yay! Infochimps is going to Hadoop World 2010. Watch out New York! I (flip) am giving a talk titled “Millionfold Mashups” — I’ll talk about how we store, process and analyze massively numerous datasets and datasets of massive size.

We’re going to order propaganda stickers to give out, and we want to get your feedback on which to print.

Favorites? Terrible puns of your own to add? Want us to send you a set? Let us know in the comments!

  • Live Fast and Leave a Beautiful Corpus at
  • Where Hot Singles come to Dataset
  • Upload Yours.
  • Hadoop-de-doo for you
  • Dammit, No, the Other NLP
  • I’m Consistently Available. Want to see my Partition?
  • Intoxication by Miners is OK at
  • Fit your Curves at
  • Head in the Clouds?
  • Expose your Bits at
  • Support Vector Machines!
  • Free Variables
  • Everyone at our Datacenter has a Nice Rack
  • Bayesians Against Discrimination
  • Map Reduce, Map Reuse, Map Recycle
  • PAXOS in our time
  • Pro Axiom of Choice
  • Big Chimpin’
  • We have the most Cunning Linguists
  • P = NP
  • P != NP

Several of the slogans shamelessly stolen from this protest by CMU Machine Learning researchers, which I love so much it hurts.

5 Interesting Data Articles

Inspired by Pete Warden’s Five Short Links, we decided we’d put up a post about the most interesting data articles we’ve come across in the recent months.

Data, Data Everywhere: The Economist ran a pretty comprehensive and accessible special report on data with a series of articles covering the different implications – both good and bad – of the growing amount of data in existence. Make sure you click on the links “In this special report” to read the rest of the articles.

Personal data collection
The Data-Driven Life: This New York Times article shows that even the most mundane-seeming data can be useful. Shared are stories about people who collect personal data using tools, applications and processes, bringing home the point that all this tracking isn’t merely creepy – it gathers data that can help us make better informed decisions.

Informavore: The Future of Data Privacy: Here the author explores the extent to which social network data should be private. Citing various case studies and Danah Boyd’s talk during SXSWi earlier this year, she highlights many points of debate and provides readers with some food for thought.

Data visualization
Four Ways of Looking at Twitter: Jeff Clark is a data viz enthusiast and has taken Twitter data and created four interesting visualizations. The four are just a few of the visualizations that have come about in the past few months, and are great examples of what people can do with the rich data source of social networks.

Twitter influence
On Twitter, Followers Don’t Equal Influence: This research is a great example for why is a better measure of influence than followers count, and goes much more in-depth in explaining why. This press release from Carnegie Mellon University has some similar validations. Scientists there determined that Twitter could be as good at determining public opinion as a Gallup Poll.

Freebase Hack Day & Updates

Our friends at Freebase are having another Hack Day in San Francisco this July.  It’s only two weeks away now and the remaining tickets can go fast, get involved

Learn about the many cool things that Freebase is doing with their data, and the tools that can be built using their platform.

On a side note, has gotten a facelift.  We’d love feedback on it:   We hope your browsing experience is better, and we will be happy to roll out new features soon!

@mrflip's OpenGov Talk: Data Commons and Transparent Government

Here is my (mrflip’s) SxSW OpenGov talk, “How Open Data will help build Open Government“:


There is nothing more painful than watching yourself talk. So I haven’t gone all the way through this video — if you see me don’t give away the ending. Huge thanks to Silona Bonewald (League of Technical Voters) for organizing this, and to Terry Walhus ( for taping and copying and editing and uploading the videos.

I love it when a plan comes together…

So Simon Willison (@simonw), one of the architects of The Guardian’s Open Platform and co-creator of a modestly popular web frameworks is here at SxSW and gave an informal talk (on Zeppelins, of course – what else?). Freebase community manager Kirrily Robert (@skud) saw my tweet and proposed a meetup. After iteratively solving the three body problem, we put out the word on Sunday morning for a meetup on Sunday evening… SemWebAustin @juansequeda and Freebase @jameshome each pinged their 1-neighborhood and next thing you know I’m sitting next to Jure Cuhalev of Zemanta and machine learning machine @Nikete trying to orchestrate overflow seating for 25+ data geeks.

The reason for the gossip-column style of this post is to show the size and breadth of the data geek crowd. James Home and I agree that we need to turn out this Cyrus’ army of data geeks to take over a much larger part of SxSW next year. We need talks on column-store databases and hadoop, linked data and the construction of the data commons, how NLP and machine learning can power inspiring audience-driven websites, on the developing grammar of Information Visualization, on Processing and Prefuse and R. Pete Skomoroch, Mike Driscoll and Christian Chabot all ended up skipping SxSW this year; we need them leading a panel discussion on how to visualize >10M point datasets with limited-bandwidth desktop and web interfaces. I’d like to hear Deepak Singh and one of the @cloudera’ns drop science about scalable cloud computing.

The evening was just informal mingling and conversation, but on request of request of @mndoci and @dataspora, here is our name-droppy slice of the whirlwind:

@mrflip: Learned about how Zemanta is already putting Linked Data and NLP together to make blogging better. Jure is excited about infinite monkeywrench and might be brave enough to pre-alpha its inchoate HTML munger. Got to hear what Blaine Cook of Osmosoft is doing to solve the fractured twitter/facebook/’ve-never-heard-of ecosystem, and he gave some great feedback on our upcoming Twitter Census. Also got to learn, after pontificating that OAuth is hard, that I was talking to its architect; a great discussion with Blaine and ENTP Uruguay Evan Henshaw-Plath followed about the Rails authorization/identity/authentication stack.

Mike Migurski of @stamen is going to get together with infochimp @dhruvbansal to push the Open Street Maps dataset into Amazon Public Data Sets collection. Harper Reed of Threadless was running off for a 6am (ugh) flight to babysit servers in Chicago by the time we chatted, but pointed towards his Chicago Transit API project. His post on Hidden APIs is a great read BTW. Ran into @Slicehost Matt Tanase at a party after; Rackspace is getting much Cloud-ier, including a 1.5cents/hour pay-as-you-go 256MB slice offering. I’m hoping to talk later about our MachetEC2 project and get his thoughts about how to put open data on tap in the cloud. Jon Pierce and I discussed the Mets’ chances this year and what he sees for big data startup possibilities. Only got to briefly intersect with Andrew Turner about open geocommons, and was chagrined to learn I was shoulder to shoulder with one of gnip but didn’t get to chat. Hope to fix that later.

This meeting alone made SxSW worth it, and I’m looking forward to more discussion later. You can stalk me on twitter as @mrflip or at By the way, I’m giving a lightning talk on Open Data in government at Fiddler’s Hearth, 301 Barton Springs Rd at 12:30 — drop by or catch the webcast later.