Products & Features

How to create datasets that the rest of the world needs

We recently created a dataset for the web site that is a map between IP addresses to zip codes and census demographic information. The work that was involved in this is representative of the type of community we want to have involved with Infochimps in the future. The type of people that will find this dataset useful – web site owners, internet advertisers – are not always going to be the same people that can create such a dataset. This division of labor can only happen when experts at data gathering can share their data in a place where people that want to use the data can find it.

Our social media expert Maegan recently interviewed Carl, a member of our data team, to talk about this dataset creation process. You can find the IP-Census data he’s talking about here: http://infochimps.org/collections/ip-address-to-us-census-data.

M: Hi Carl, would you start by introducing yourself and telling us what you do for Infochimps?

C: I’m a member of the data team here at Infochimps. Basically, the team in charge of gathering data that’s available on the web, cleaning it up and making it more useful for other people out there that are looking for this sort of data.

M: I can imagine how appealing that data is to a lot of people. Speaking of useful data, I heard that you recently came up with a collection of datasets that link IP addresses to Census information. Can you tell me more about it?

C: Well, we heard from a few people that that sort of thing might be interesting. There are a lot of people out there want to know more about the people that come to their website. Using this dataset, they can get demographic details by using the IP address of their visitors. That way they can improve their understanding of their audience and target the content on their website better. The dataset that we have links IP addresses to zip codes, and then zip codes to all sorts of demographic data from the Census.

M: I saw that you have so many different types of information from the Census. Where did you go to find the data to mash together?

C: For the Census data, that’s a fairly well-known source. The US government has a Census website, Factfinder.census.gov, where you can go to download all sorts of information. As far as the IP to geolocation data, there are lot of datasets available. We were looking for one that had good coverage of IP addresses, was available for free, and had a license that allowed us to take that data, do what we wanted with it and make it available on our site.

M: Is this a new kind of dataset? Or is it available elsewhere?

C: The IP to geolocation dataset is available from where we got it – at MaxMind. Linking that to the Census data is something that I don’t think we’ve seen elsewhere.

M: How did the process work once you had the data?

C: The Census data is divided into a lot of different geographic segments – national, state, city, county and all those sorts of things, but the IP geolocation data only uses zip codes. We wanted just the data from the Census that’s associated with the zip codes, so I had to comb through the Census data and pull out just the lines of the data that are associated with zip codes and then use that to match up to the IP addresses in the geolocation data.

M: Is it just how they’re organized?

C: Yeah, it’s more of how it’s organized. The Census data is organized into a few different files. You have one file that lists all the different breakdowns of how the data is divided up – like how I was saying, by state, city, zip code or the country. Each of those breakdowns was associated with this logical record number. Then, the actual Census data files have the logical record number at the beginning and then all the numbers associated with the different fields in the rest of the file. I had to pick out just all the logical record numbers that were associated with the zip codes in the first files and then pull all those out of the Census data to match it to the zip codes from the IP addresses.

M: I would imagine that Census data would involve big files – did this make them difficult to manage?

C: Yeah, the Census data files are really large and so it took a lot of space to load everything into memory. Then, I made a list of what data we needed from the Census data files and searched through them line by line to match zip codes to demographic information.

M: That sounds like a lot of work. Did you have to do anything else to process the data?

C: The other thing that I did was figure out the column headings to make it more useful. The way it was presented by the US Census bureau is that each column of data has a column heading that is just a code that you look up somewhere else to figure out what it actually meant. I went through and did a lot of manual editing to make the column headings more readable. Now if you just look at it, you have a better idea of what’s actually going on and it’s not just meaningless code.

M: How did you find data with licenses that actually let you mash them?

C: We were looking for specific datasets that had the licenses with certain properties that let you freely download, mash and mix up the data with other datasets, and sell it on your own site or do anything commercial with it. Of course, most of these licenses have attribution requirements, so we made sure to list all our sources in the dataset. The final dataset that we have available clearly says that this data originally came from the US Census Bureau and this MaxMind website.

M: In the end, what licenses did you put on the dataset that you made?

C: The license that is on there now is a very open license that lets users use the data for whatever they need. It is the Open Database License.

M: Are there any other difficulties you faced?

C: One of the issues that we wanted to make sure was cleared up was that the IP address data that we got was reliable and would cover a lot of IP addresses. It needed to have broad coverage of general IP addresses. We did a quick test and used the logs from our own website, took IP addresses from 6 months worth of page visits, and ran all those IP addresses through the IP address database. It turned out that it matched over 90% of the IP addresses that we had, and so that was a pretty good indication that the IP address dataset we had was fairly complete and had very good coverage compared to others which we heard would have only 50% coverage.

M: Is the availability of the IP addresses a privacy concern?

C: I don’t think it’s a privacy concern because it’s not matching it up to a specific address, but it’s matching it up to a zip code. Since zip codes have a very large number of people, it’s hard to determine if that IP address is coming from one specific person or even one specific household.

M: Ok, thank you very much, Carl.

Data.gov import

Infochimps is pleased to announce a recent import of all of the data from Data.gov!  Data.gov was one of the more exciting things to happen last year for the world community and it has had a big impact in the US and internationally by setting precedent for government data sharing.  We hope that these datasets’ inclusion in our collection increases the visibility for all these datasets and becomes useful for the world at large.

The fact that users can edit this data makes them much more usable and interesting.  Unlike Data.gov, users on Infochimps can upload datasets and even upload different versions of datasets to the site.  So when a dataset comes from the government in some messy, incomprehensible format, you can do what Infochimps user Ganglion did and upload a better version.  This type of Wikipedia style curation of datasets is where Infochimps got its name.  Because data drudge work (column titles, formatting issues, etc.) is fit for a chimp, this type of work should only be done once.  And may the result live on Infochimps!

Take a look at the Data.gov collection to get started.

Twitter data, open questions to Developers, Academics, and Data Geeks

We are excited to announce the re-release of the Twitter datasets, and a discount to the Twitter API Map dataset.  Again, the datasets are:

and

Conversation Metrics, with Token Count of:

This time the data is being released with Twitter’s approval.  We are talking with them about how we can increase access to more and more bulk data, and need your help in showing them how useful this data really is.

We want to make clear to people with privacy concerns that we absolutely hear and respect your points, and so does Twitter.  These datasets contain NO personally identifiable information, they do NOT contain whole tweets, and they meet the guidelines laid out in this EFF document (on personally id’able info).

We encourage everybody to take advantage of this weekend’s discount and go build great things with this data.  Let’s show Twitter and the world what is possible when one has access to bulk data:

  • Data geeks and Visualization studs: what would you do if you could run jobs across our massive crawl (or the full Twitter graph)?
  • App devs: what data do you want those nerds to extract?  How would it improve the experience of Twitter or enable new things?
  • Businesses: how can this data improve your services?  How can this data make you money?
  • Academic researchers: what amazing things will you uncover by exploring the social network’s deep structure?

Reach out to us in the comments or send us ideas at info@infochimps.org

Twitter data update

Our launch of the Twitter data was a great success, and we thank Marshal Kirkpatrick at ReadWriteWeb (also) and Jordan Golson at GigaOm for their coverage. The community reaction has been overwhelming and energizing. We accomplished our two main goals: crack open some issues close to our hearts and kick-start the conversation about sharing data online.

Twitter has advanced some reasonable concerns, however, and have asked us to take the datasets down. We have temporarily disabled downloads while we discuss licensing terms. The outcome of discussions will, we hope, encourage more internet services to open up and share data in bulk. The two biggest issues this data release highlighted are third party redistribution and user privacy.

Redistribution rights. Twitter maintains a legendarily open API:

“Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.
“We encourage and permit broad re-use of Content. The Twitter API exists to enable this.” [highlighting added by us]

However, Twitter wants to more closely control who has access to data at massive scale and to prevent its malicious use. We understand this concern — innovation is always a double-edged sword. The applications and services that can use this data to make the world a better place far outnumber those with bad intentions, however, and good people need better access to this type of data. The best solution is to apply a reasonable license to the data. We are addressing this in our talks with Twitter, and we expect to have a resolution soon.

User privacy. What little criticism we heard from the community was the potential for a breach of user privacy. This is an issue with many types of internet data, and one we take seriously. We ensured that the datasets released posed no such dangers. The Token Count data contained no personally identifying information, only what the entire mass of twitter users were discussing over time. The API ID Mapping Dataset is simply a sort of phone book for the Twitter APIs: it converts screen names to numeric IDs and reveals absolutely nothing about the corresponding user. Infochimps.org’s policy is to not host any personally identifying information of non-consenting individuals — we apply this rule to any data that goes on the site from any source.

These are hard issues and it took a bold move to bring them into the open. It will take further sharing and discussion to establish best practices for these concerns so that Twitter and other internet services (Facebook, Amazon, etc.) can share their data to the benefit of the greater online community. Stay tuned while we agree upon appropriate licensing for open sharing of this social data.

Twitter Census: Publishing the First of Many Datasets

As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find.

Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006. The initial datasets are a part of our Twitter Census collection.

The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org. For example, use it to look at the adoption of Google Wave on the rate of its mentions. On one payload’s page you will find a snippet with a sample taken during Kanye West’s outburst in September, and on another’s you can see that the “:)” emoticon has been used 135,000 times.

The second dataset solves a large problem developers have when they use Twitter’s Search API and the Twitter API, as each API gives back a different unique string for every user on Twitter. This dataset maps user IDs between the two API’s for 24.5 million users. This mapping should be a godsend to Twitter app developers, as it allows them to easily combine data from each API, letting API calls for friends lists mix easily with searches on the Twitter Search API.

These datasets are only views from the massive collection we have been growing over the last year. We will be releasing additional datasets regularly over the next few weeks so please check back for updates. If you’d like a custom slice or analysis done on this data, please get in touch at imw@infochimps.org.

With the release of this data, we hope to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers. This should start a conversation about where value really lies in this type of data, the various ownership and privacy issues that arise, and that Infochimps.org is the place to go to find data. We invite interested parties to get in touch and begin uploading their data(try invite code “newsupplier”) today as part of the Infochimps marketplace.

API’s and Datasets, living in harmony

The most popular way for one to access data on the web right now is through an API.  API’s provide real-time data, an incredible advantage, and outsourced computation.  These are advantages for the end-user and the developer, where the API provider has to eat the cost of providing such a service.  It is worth it, though, for the provider of the API, as a myriad of services can be conjoined with their primary service.

There are some things an API can’t give you, though.  An API can generally not give you historical data, as with Twitter’s API only letting you go back XX number of tweets.  This means that a service built late in the game may not carry the same value as a service that was built in the early days of an API, as the latter’s data goes back further.

Next, API’s only give you peices.  The scale of questions you can ask is limited by the rate limit and sizes of the peices to return.  Services can’t ask for everything and they may be further limited by the bandwidth and load on the primary API.

The types of questions we’re talking about have to do with the deep structure of the data in question.  One of the reasons our near-complete scrape of Twitter’s friend graph was so popular is because this type of dataset is extremely valuable to network researchers.  The sort of research a graph like Twitter’s makes possible is phenomenal.  Without such a dataset, reserachers are left like the Antarctic exploresrs of the past – slowly crawling new territory, making maps and filling in details only as they come along, peice by peice.

The value API’s provide to the service and the outside world is undeniable.  The problems that API’s leave open can be solved by those services providing complete dumps periodically.  These datasets of complete and historical data will not only let researchers get to work improving their science, but will also allow applications to seed their service with the latest dataset, then begin updating through the API.

Should services share their data on a platform like Infochimps, they not only provide a great service to applications and researchers, but they also reduce their own costs.  The load on their API is lighter as less requests have to be made for data.  And, when researchers have the complete dataset sitting on their hard drive, the API’s provider will not be depended upon for compute time, as the researcher’s local access to the data will make his job much faster and easier.

The two solutions for sharing data are complimentary.  Freebase does a great job at this, we are hoping other services will soon follow suit.

It's Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames.

It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.

I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.

This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.

(more…)

Congrats Retrosheet – another decade of rich Baseball data online

Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.

Infochimps metadata entries for these datasets:

The Asdrubal Cabrera Hall of Fame

Prompted by my friend’s skepticism that the ballplayer Milton Bradley is really so named, I’m exhuming this old post from elsewhere. — flip

During the 2007 baseball playoffs, announcer Tim McCarver perspicaciously observed that “Asdrubal Cabrera is the only player in the majors with that first name”. Thus inspired, I present The Asdrubal Cabrera Hall of Fame: Major League ballplayers in unique possession of their particular first name. (Some are nicknames, many are not — but these are their official names, as used in newspapers and the rolls of history. F’reals.)

You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard recounted the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith? Mul Holland, Sixto Lezcano, Welcome Gaston or Mox McQuery? There’s a bunch more after the jump, and a complete listing here, including links to each player’s baseball reference page.

For some dinnertime fun over the holidays, discuss the relative merits of naming your next child after Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, or Buttercup Dickerson. (Unfortunately, 12 other “Rusty”s keep fan favorite Rusty Kuntz off this list, and believe it or not two other “Stubby”s bar the way for Stubby Clapp. I apologize to anyone whose internet filter has or has not prevented reading this apology.)

Thanks to the Baseball Databank and Retrosheet, I had this dataset on hand, and thanks to a monastic life of nerdity I had the SQL chops to pull up this query between innings.  But I should be able to do this with anything, whether or not I know a SQL Query from a Queer-Eye Sequel, for silly stunts and for changing lives alike.

Imagine instead I were a public health expert, interested in the effects of limiting medical residents to an 80-hour work week. Might lives be saved if I could effortlessly pull up historical data on rates of doctor-induced complications, board of medicine complaints, relative rates of med school and law school applications, and open-government data on medical regulations?

The long-term mission of infochimps.org is to democratize this: to put the world’s analytic data at our fingertips, supporting tools that let anyone manipulate, interrogate, visualize and explore that data.  Giving baseball geeks a chance to show up Tim McCarver isn’t much of a start, but here we are.

More awesome first names after the jump….

(more…)

Massive Scrape of Twitter’s Friend Graph

UPDATE:

We’ve posted several Twitter datasets on Infochimps. Take a look and build something cool!

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

(more…)