Monthly Archives December 2010

Infochimps Acquires

We’re pleased to announce that Infochimps has acquired We’ve been admiring what they have been doing for awhile now, so we jumped at the opportunity when it presented itself. is a Y-Combinator funded company, founded by Steve DeWald and Matt Hodan at the beginning of this year, with the original vision to be the “Amazon for structured information.”

When I met with Steve to chat about his vision, it was apparent that our two companies shared many of the same philosophies and visions for the future of Big Data. Hell, even our platforms are built upon many similar foundations and tools, like Ruby on Rails and Heroku. So, the transition has been smooth, and the site is already running on Infochimps servers.

From the Co-Founder and CTO of, Steve DeWald:

Data is one of the most valuable assets in the world. We use it for decisions every day, and enormous industries are built around compiling and organizing it. It costs almost nothing to share, but despite that there is no single pervasive marketplace for buying and selling data. That’s the problem we tried solving with Data Marketplace.

It’s a problem because the fragmented nature of data creates friction for those wanting to share it. As a seller of data, there’s no easy or standardized way for to monetize it. Often times the expectation is to sell it in an expensive research report and have the raw data separately available by request. That’s fine, but that’s only capturing a fraction of a fraction of percent of all the useful data that people could be selling. Likewise there’s a lot of data people want to be selling that potential buyers can’t find. As a consumer of data, I often search on Google for the data I’m looking for, though frequently the data I want is behind a pay-wall and keywords are not being properly indexed for search. All these problems could be solved for the betterment of humanity with standardized and open marketplace for data.

Although Matt and I have moved on to other projects (I’m selling custom made suits online), I am happy to be putting our work in the hands of the talented team at InfoChimps, which has built the world’s largest open marketplace for data.

Thanks for the kind words, Steve.

We’re excited to integrate into Infochimps. As Nick Ducoff, Infochimps CEO, says:

Just as Salesforce recently extended their brand with, we’re extending ours with, which fits well into our overarching strategy to be the destination on the web for data and data services.

Q & A’s relating to acquisition:

If I have uploaded my data to, what’s going to happen with it?

Will it still be available for purchase, and will I receive my royalties? All datasets that are available on will soon be available through, and will continue to live on Customers will still be able to browse and purchase the data, and we will ensure that you receive your royalties from sales of that data.

What will happen to my user account on

Your account will still survive on, and you will soon receive an email with details on how to login at It’s important for us to maintain the community, and we will notify you of any changes to your account in as few emails as possible.

What will happen to the Data Requests on

We will continue to support the data requests feature on, and we do not plan to remove or change any of the requests that are on the site at present. We will notify requesters of changes to their requests or their account.

Measuring online influence: The case for Big Data

Measuring influence online is in its infancy. Unrefined ‘metrics’ dominate the space and much of what exists currently is of little value and has insignificant statistical meaning. Most measure only what is easy to measure – number of Twitter followers, number of times a word is mentioned in the last week, Facebook ‘likes’, bizarre – often undisclosed – methods of calculating someone’s online social ‘rank’. There are even gimmicky schemes to produce ‘measurements’ like Fast Company’s “Influence Project”*.

Why do we do things this way? Because it is easy and because dealing with big data is hard.

The most valuable measurements that come out of this space live in the analysis of big data. To be effective, one needs a global perspective and all the connections – not just of the 100 active million Twitter users, but the 4 billion connections between them and even more, the billions of additional connections implied by mentions, retweets and replies.

A simple search of Twitter will yield you a small sample of unfiltered tweets in a short time frame. A count of followers is as easy as visiting a user’s profile page.
Those metrics miss out on what is important. Which tweets were most significant? Which users made the most impact? Are someone’s followers actual people or just spam bots? Are they just part of an auto-follow-back ponzi-scheme initiated by some live-for-3-days-and-milk-AdWords-for-all-it’s-worth-webapp?

But what should we measure?

Last week, HP published a study on what makes a tweet influential and the problems around the measure of “Influence”. Several bloggers responded (1,2,3). In general it is agreed that “Influence” is still not fully defined and that retweets alone are not the definition. Retweets are one way to measure, links are another. “Engagement” as a whole is an intersecting issue. Clearly though, there is an interest in measuring all of these ‘things’, if only we could define them.

How should measurements be delivered?

Companies such as Klout give composite numbers that, while useful, fall short of being helpful for a wider range of use cases such as spam filtering, topic relevancy and understanding relationships in one’s network.  “Influence” is a broader topic that spans much more than just global rank or one’s own reach.  Transparency is an issue as well.  While a single number helps at being actionable, it is also a set of magically combined factors that comes with no clue as to how they are combined and weighted.  People should know where their data came from and how it was produced if they are to trust it.

Solving for Influence

The data is out there, the tools exist to analyze it, but the world still unsure of what to tell those tools to do.

Infochimps uses the full friend graph and historical tweets of Twitter back to 2006 to produce similar metrics to what HP discusses such as Sway (how much a user gets retweeted), trstrank (global ranking of influence, Google PageRank style) and Enthusiasm, which is a reverse measure of what HP refers to as “Passivity” – that is, how often someone retweets someone else.

There exists a very basic problem in nomenclature.  What are the words we should be using and what are the human behaviors they represent?

More importantly – HP, Infochimps and others are still discovering just what should be analyzed. What is still needed is for people in other fields to fill in those gaps.  In sociology and marketing: what is the difference between fame, interestingness, influence and even, infamy?  What constitutes the ‘humanness’ of a user (are they real, are they just a bunch of employees tweeting for a celebrity)?  What are influence and engagement? What should we be looking for in relationships? In the social CRM space: how should you identify relationship networks?  What are the best ways to find a route to new leads?

What are the use cases in each of these fields for the data?  The big data world has the tools to quantify the data; it just needs to know which questions to answer.  Once we know what behaviors to look for, they can be translated into signatures that are identifiable in data.

Finally, we also need the answer to what the methodology is for combining the data into actionable metrics.

Measuring what is difficult instead of what is easy is a game changer. In fact, not just a game or a ballpark, but a whole sport. The knowledge is in the data.

* If you must visit it, here’s the address:

Banana Bread (of doom?)

I’ve been posting data to the Infochimps site for a little while now, learning as I go, and I keep coming across things that are really interesting, reports and the like, that just don’t have a real home in our repository because they’re translations, visualizations, interpretations, manipulations, of datasets. If datasets are bananas, then some of what I’m finding is kinda like banana bread. :)

Anyway, I thought it’d be cool to share some of the tasty banana bread I come across.

Maybe this one is morbid to start off with, but I’ve been working on posting political and legal datasets for the last week or so and today I was posting links to datasets from the Death Penalty Information Center when I found this recently updated (11/29/2010) Fact Sheet about the Death Penalty that I thought I’d share:

Screen shot 2010 12 03 at 2.48.14 PM Banana Bread (of doom?)

Death Row Exonerations by State 1973-present.