Products & Features

Our 7 Most Popular Data Categories

As a marketplace for data, people often ask us what are the most popular types of data. Interestingly, the answer isn’t very intuitive. For example, who would have guessed that one of our most downloaded datasets is a crossword puzzle word list? With this in mind, we decided to investigate and come up with a more complete answer.

Using metrics from the Infochimps website, such as search queries, downloaded content, data requests, page views, and so forth, we came up with a list of the top 7 searched for data categories:

1. Social Networks
2. Economics & Finance
3. Demographics
4. Education
5. Sports
6. Geography
7. Music

Why is this list important?
Many different people may find this information useful: From data sellers who want to know what kinds of data people would buy, trend watchers who are tracking what is popular, to anyone who is curious about what types of data are out there.

What can we learn from this list?
One thing that we can take away from this list is that there is a good indication of data becoming more mainstream. Data that is being searched for isn’t limited to academically-oriented information that traditional data users (such as researchers) would use. Categories like sports and music are of interest to a wider audience.

Furthermore, there is a demand for these types of data, but not always a supply. As the usefulness of data is becoming more realized, the infrastructure to facilitate the data process is still forming. Infochimps is one of the platforms that aims to bridge the gap, through functions like the dataset request page, and provides a venue for data exchange. If you’ve got data in these categories, know that there are people out there who want your data.

Announcing bulk redistribution of MySpace data

Today, we’re excited to announce the availability of MySpace data for bulk download on Infochimps. This is a major step forward for Myspace in our eyes – a move that signifies their seriousness about data and the developers and academics that work with that data.

This data is not sold by MySpace, but given out for free from their API and then packaged by Infochimps for redistribution. By giving developers free access to publically available real-time data (such as status updates, music, photos, videos) MySpace reinforces its commitment to powering the real-time social Web and the development of open standards.

  • Every day, MySpace processes over 32 million activities and updates
  • MySpace opened up its real-time data with free-to-use APIs letting developers create robust products
  • MySpace offers more scale and richer content like music, photo, videos, apps than anyone else
  • Real-time data input and the ability to then share that in real-time will drive the socialization of content on the web
  • Data available for bulk download will help usher the next generation of data-driven research and application development. Now, using a dataset like word count by hour, developers and content providers can better understand how things are talked about and when.

    The benefits of having data available for bulk download instead of just an API are numerous. Developers can start with a sample dataset and get their apps started faster. Academics are much better served by a .csv than an API, and developers can take advantage of the datasets these experts create as a result of their research. Opening one’s data to the big data community makes all this and much more possible.

    API’s aren’t enough. New tools like Hadoop allow for the processing of huge datasets but necessitate having a local copy of the entire dataset. The advanced analytics that come from computing on top of a huge dataset (and at 25GB/day the MySpace stream is massive) will power the next generation of applications.

    The developers looking for this data can come to Infochimps to find the data they need. Let’s harbor a division of labor between the people who are experts in mining this data for insight, and the pros who can develop the applications on top of those discoveries. For example, Ryan Rosario of UCLA created a dataset of user’s moods by zip code, a historical emotional context for researchers, psychologists, and possibly a developer looking to take advantage of this MySpace feature.

    We’ll premiere the “best of MySpace” datasets in the hopes of supporting a relationship between MySpace and data-driven research and development. And any API owners out there should get in touch to talk about how we can make your data computable for the big data community.

    UPDATE: Here is a visualization of Users with geolocations from our dataset, User locations by lat/long:
    figure Announcing bulk redistribution of MySpace data

    How to create datasets that the rest of the world needs

    We recently created a dataset for the web site that is a map between IP addresses to zip codes and census demographic information. The work that was involved in this is representative of the type of community we want to have involved with Infochimps in the future. The type of people that will find this dataset useful – web site owners, internet advertisers – are not always going to be the same people that can create such a dataset. This division of labor can only happen when experts at data gathering can share their data in a place where people that want to use the data can find it.

    Our social media expert Maegan recently interviewed Carl, a member of our data team, to talk about this dataset creation process. You can find the IP-Census data he’s talking about here: http://infochimps.org/collections/ip-address-to-us-census-data.

    M: Hi Carl, would you start by introducing yourself and telling us what you do for Infochimps?

    C: I’m a member of the data team here at Infochimps. Basically, the team in charge of gathering data that’s available on the web, cleaning it up and making it more useful for other people out there that are looking for this sort of data.

    M: I can imagine how appealing that data is to a lot of people. Speaking of useful data, I heard that you recently came up with a collection of datasets that link IP addresses to Census information. Can you tell me more about it?

    C: Well, we heard from a few people that that sort of thing might be interesting. There are a lot of people out there want to know more about the people that come to their website. Using this dataset, they can get demographic details by using the IP address of their visitors. That way they can improve their understanding of their audience and target the content on their website better. The dataset that we have links IP addresses to zip codes, and then zip codes to all sorts of demographic data from the Census.

    M: I saw that you have so many different types of information from the Census. Where did you go to find the data to mash together?

    C: For the Census data, that’s a fairly well-known source. The US government has a Census website, Factfinder.census.gov, where you can go to download all sorts of information. As far as the IP to geolocation data, there are lot of datasets available. We were looking for one that had good coverage of IP addresses, was available for free, and had a license that allowed us to take that data, do what we wanted with it and make it available on our site.

    M: Is this a new kind of dataset? Or is it available elsewhere?

    C: The IP to geolocation dataset is available from where we got it – at MaxMind. Linking that to the Census data is something that I don’t think we’ve seen elsewhere.

    M: How did the process work once you had the data?

    C: The Census data is divided into a lot of different geographic segments – national, state, city, county and all those sorts of things, but the IP geolocation data only uses zip codes. We wanted just the data from the Census that’s associated with the zip codes, so I had to comb through the Census data and pull out just the lines of the data that are associated with zip codes and then use that to match up to the IP addresses in the geolocation data.

    M: Is it just how they’re organized?

    C: Yeah, it’s more of how it’s organized. The Census data is organized into a few different files. You have one file that lists all the different breakdowns of how the data is divided up – like how I was saying, by state, city, zip code or the country. Each of those breakdowns was associated with this logical record number. Then, the actual Census data files have the logical record number at the beginning and then all the numbers associated with the different fields in the rest of the file. I had to pick out just all the logical record numbers that were associated with the zip codes in the first files and then pull all those out of the Census data to match it to the zip codes from the IP addresses.

    M: I would imagine that Census data would involve big files – did this make them difficult to manage?

    C: Yeah, the Census data files are really large and so it took a lot of space to load everything into memory. Then, I made a list of what data we needed from the Census data files and searched through them line by line to match zip codes to demographic information.

    M: That sounds like a lot of work. Did you have to do anything else to process the data?

    C: The other thing that I did was figure out the column headings to make it more useful. The way it was presented by the US Census bureau is that each column of data has a column heading that is just a code that you look up somewhere else to figure out what it actually meant. I went through and did a lot of manual editing to make the column headings more readable. Now if you just look at it, you have a better idea of what’s actually going on and it’s not just meaningless code.

    M: How did you find data with licenses that actually let you mash them?

    C: We were looking for specific datasets that had the licenses with certain properties that let you freely download, mash and mix up the data with other datasets, and sell it on your own site or do anything commercial with it. Of course, most of these licenses have attribution requirements, so we made sure to list all our sources in the dataset. The final dataset that we have available clearly says that this data originally came from the US Census Bureau and this MaxMind website.

    M: In the end, what licenses did you put on the dataset that you made?

    C: The license that is on there now is a very open license that lets users use the data for whatever they need. It is the Open Database License.

    M: Are there any other difficulties you faced?

    C: One of the issues that we wanted to make sure was cleared up was that the IP address data that we got was reliable and would cover a lot of IP addresses. It needed to have broad coverage of general IP addresses. We did a quick test and used the logs from our own website, took IP addresses from 6 months worth of page visits, and ran all those IP addresses through the IP address database. It turned out that it matched over 90% of the IP addresses that we had, and so that was a pretty good indication that the IP address dataset we had was fairly complete and had very good coverage compared to others which we heard would have only 50% coverage.

    M: Is the availability of the IP addresses a privacy concern?

    C: I don’t think it’s a privacy concern because it’s not matching it up to a specific address, but it’s matching it up to a zip code. Since zip codes have a very large number of people, it’s hard to determine if that IP address is coming from one specific person or even one specific household.

    M: Ok, thank you very much, Carl.

    Data.gov import

    Infochimps is pleased to announce a recent import of all of the data from Data.gov!  Data.gov was one of the more exciting things to happen last year for the world community and it has had a big impact in the US and internationally by setting precedent for government data sharing.  We hope that these datasets’ inclusion in our collection increases the visibility for all these datasets and becomes useful for the world at large.

    The fact that users can edit this data makes them much more usable and interesting.  Unlike Data.gov, users on Infochimps can upload datasets and even upload different versions of datasets to the site.  So when a dataset comes from the government in some messy, incomprehensible format, you can do what Infochimps user Ganglion did and upload a better version.  This type of Wikipedia style curation of datasets is where Infochimps got its name.  Because data drudge work (column titles, formatting issues, etc.) is fit for a chimp, this type of work should only be done once.  And may the result live on Infochimps!

    Take a look at the Data.gov collection to get started.

    Twitter data, open questions to Developers, Academics, and Data Geeks

    We are excited to announce the re-release of the Twitter datasets, and a discount to the Twitter API Map dataset.  Again, the datasets are:

    and

    Conversation Metrics, with Token Count of:

    This time the data is being released with Twitter’s approval.  We are talking with them about how we can increase access to more and more bulk data, and need your help in showing them how useful this data really is.

    We want to make clear to people with privacy concerns that we absolutely hear and respect your points, and so does Twitter.  These datasets contain NO personally identifiable information, they do NOT contain whole tweets, and they meet the guidelines laid out in this EFF document (on personally id’able info).

    We encourage everybody to take advantage of this weekend’s discount and go build great things with this data.  Let’s show Twitter and the world what is possible when one has access to bulk data:

    • Data geeks and Visualization studs: what would you do if you could run jobs across our massive crawl (or the full Twitter graph)?
    • App devs: what data do you want those nerds to extract?  How would it improve the experience of Twitter or enable new things?
    • Businesses: how can this data improve your services?  How can this data make you money?
    • Academic researchers: what amazing things will you uncover by exploring the social network’s deep structure?

    Reach out to us in the comments or send us ideas at info@infochimps.org

    Twitter data update

    Our launch of the Twitter data was a great success, and we thank Marshal Kirkpatrick at ReadWriteWeb (also) and Jordan Golson at GigaOm for their coverage. The community reaction has been overwhelming and energizing. We accomplished our two main goals: crack open some issues close to our hearts and kick-start the conversation about sharing data online.

    Twitter has advanced some reasonable concerns, however, and have asked us to take the datasets down. We have temporarily disabled downloads while we discuss licensing terms. The outcome of discussions will, we hope, encourage more internet services to open up and share data in bulk. The two biggest issues this data release highlighted are third party redistribution and user privacy.

    Redistribution rights. Twitter maintains a legendarily open API:

    “Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.
    “We encourage and permit broad re-use of Content. The Twitter API exists to enable this.” [highlighting added by us]

    However, Twitter wants to more closely control who has access to data at massive scale and to prevent its malicious use. We understand this concern — innovation is always a double-edged sword. The applications and services that can use this data to make the world a better place far outnumber those with bad intentions, however, and good people need better access to this type of data. The best solution is to apply a reasonable license to the data. We are addressing this in our talks with Twitter, and we expect to have a resolution soon.

    User privacy. What little criticism we heard from the community was the potential for a breach of user privacy. This is an issue with many types of internet data, and one we take seriously. We ensured that the datasets released posed no such dangers. The Token Count data contained no personally identifying information, only what the entire mass of twitter users were discussing over time. The API ID Mapping Dataset is simply a sort of phone book for the Twitter APIs: it converts screen names to numeric IDs and reveals absolutely nothing about the corresponding user. Infochimps.org’s policy is to not host any personally identifying information of non-consenting individuals — we apply this rule to any data that goes on the site from any source.

    These are hard issues and it took a bold move to bring them into the open. It will take further sharing and discussion to establish best practices for these concerns so that Twitter and other internet services (Facebook, Amazon, etc.) can share their data to the benefit of the greater online community. Stay tuned while we agree upon appropriate licensing for open sharing of this social data.

    Twitter Census: Publishing the First of Many Datasets

    As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find.

    Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006. The initial datasets are a part of our Twitter Census collection.

    The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org. For example, use it to look at the adoption of Google Wave on the rate of its mentions. On one payload’s page you will find a snippet with a sample taken during Kanye West’s outburst in September, and on another’s you can see that the “:)” emoticon has been used 135,000 times.

    The second dataset solves a large problem developers have when they use Twitter’s Search API and the Twitter API, as each API gives back a different unique string for every user on Twitter. This dataset maps user IDs between the two API’s for 24.5 million users. This mapping should be a godsend to Twitter app developers, as it allows them to easily combine data from each API, letting API calls for friends lists mix easily with searches on the Twitter Search API.

    These datasets are only views from the massive collection we have been growing over the last year. We will be releasing additional datasets regularly over the next few weeks so please check back for updates. If you’d like a custom slice or analysis done on this data, please get in touch at imw@infochimps.org.

    With the release of this data, we hope to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers. This should start a conversation about where value really lies in this type of data, the various ownership and privacy issues that arise, and that Infochimps.org is the place to go to find data. We invite interested parties to get in touch and begin uploading their data(try invite code “newsupplier”) today as part of the Infochimps marketplace.

    API’s and Datasets, living in harmony

    The most popular way for one to access data on the web right now is through an API.  API’s provide real-time data, an incredible advantage, and outsourced computation.  These are advantages for the end-user and the developer, where the API provider has to eat the cost of providing such a service.  It is worth it, though, for the provider of the API, as a myriad of services can be conjoined with their primary service.

    There are some things an API can’t give you, though.  An API can generally not give you historical data, as with Twitter’s API only letting you go back XX number of tweets.  This means that a service built late in the game may not carry the same value as a service that was built in the early days of an API, as the latter’s data goes back further.

    Next, API’s only give you peices.  The scale of questions you can ask is limited by the rate limit and sizes of the peices to return.  Services can’t ask for everything and they may be further limited by the bandwidth and load on the primary API.

    The types of questions we’re talking about have to do with the deep structure of the data in question.  One of the reasons our near-complete scrape of Twitter’s friend graph was so popular is because this type of dataset is extremely valuable to network researchers.  The sort of research a graph like Twitter’s makes possible is phenomenal.  Without such a dataset, reserachers are left like the Antarctic exploresrs of the past – slowly crawling new territory, making maps and filling in details only as they come along, peice by peice.

    The value API’s provide to the service and the outside world is undeniable.  The problems that API’s leave open can be solved by those services providing complete dumps periodically.  These datasets of complete and historical data will not only let researchers get to work improving their science, but will also allow applications to seed their service with the latest dataset, then begin updating through the API.

    Should services share their data on a platform like Infochimps, they not only provide a great service to applications and researchers, but they also reduce their own costs.  The load on their API is lighter as less requests have to be made for data.  And, when researchers have the complete dataset sitting on their hard drive, the API’s provider will not be depended upon for compute time, as the researcher’s local access to the data will make his job much faster and easier.

    The two solutions for sharing data are complimentary.  Freebase does a great job at this, we are hoping other services will soon follow suit.

    It's Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames.

    It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.

    I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.

    This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.

    (more…)

    Congrats Retrosheet – another decade of rich Baseball data online

    Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.

    Infochimps metadata entries for these datasets: