Monthly Archives March 2010

SxSW 2010 — Lecture Notes

Announcing bulk redistribution of MySpace data

Today, we’re excited to announce the availability of MySpace data for bulk download on Infochimps. This is a major step forward for Myspace in our eyes – a move that signifies their seriousness about data and the developers and academics that work with that data.

This data is not sold by MySpace, but given out for free from their API and then packaged by Infochimps for redistribution. By giving developers free access to publically available real-time data (such as status updates, music, photos, videos) MySpace reinforces its commitment to powering the real-time social Web and the development of open standards.

  • Every day, MySpace processes over 32 million activities and updates
  • MySpace opened up its real-time data with free-to-use APIs letting developers create robust products
  • MySpace offers more scale and richer content like music, photo, videos, apps than anyone else
  • Real-time data input and the ability to then share that in real-time will drive the socialization of content on the web
  • Data available for bulk download will help usher the next generation of data-driven research and application development. Now, using a dataset like word count by hour, developers and content providers can better understand how things are talked about and when.

    The benefits of having data available for bulk download instead of just an API are numerous. Developers can start with a sample dataset and get their apps started faster. Academics are much better served by a .csv than an API, and developers can take advantage of the datasets these experts create as a result of their research. Opening one’s data to the big data community makes all this and much more possible.

    API’s aren’t enough. New tools like Hadoop allow for the processing of huge datasets but necessitate having a local copy of the entire dataset. The advanced analytics that come from computing on top of a huge dataset (and at 25GB/day the MySpace stream is massive) will power the next generation of applications.

    The developers looking for this data can come to Infochimps to find the data they need. Let’s harbor a division of labor between the people who are experts in mining this data for insight, and the pros who can develop the applications on top of those discoveries. For example, Ryan Rosario of UCLA created a dataset of user’s moods by zip code, a historical emotional context for researchers, psychologists, and possibly a developer looking to take advantage of this MySpace feature.

    We’ll premiere the “best of MySpace” datasets in the hopes of supporting a relationship between MySpace and data-driven research and development. And any API owners out there should get in touch to talk about how we can make your data computable for the big data community.

    UPDATE: Here is a visualization of Users with geolocations from our dataset, User locations by lat/long:
    figure Announcing bulk redistribution of MySpace data

    Data Cluster Meetup

    Austin, TX may be the live music capital of the world, but next weekend Rackspace, together with Infochimps, WolframAlpha, Factual, and knowmore, are putting together an event that will prove it’s not just about the music.

    Data geeks from all over the nation will come together to discuss the latest developments in the world of data during birds-of-a-feather sessions, talks and pure and simple mingling (not to mention munching on free food) at the Data Cluster Meetup (Sunday, March 14, 6pm at Opal Divine’s Freehouse).

    Not excited yet? Read on…

    Non-relational Database Smackdown
    Stu Hood of the Cassandra project will lead a discussion that will debate the merits of various non-relational databases. Any CouchDB or MongoDB users out there? RSVP and get in touch to be involved in the panel.

    There will be five birds-of-a-feather sessions going on concurrently. Each discussion topic chosen so that you’ll be able to find one that you are most interested in:

    1. Operations (managing data) – Stu Hood of Rackspace and the Apache Cassandra project will lead a discussion on non relational databases
    2. Analytics (exploring data) – [No moderators locked in, interested? Email]
    3. Web Applications (humanizing data) – [No moderators locked in, interested? Email]
    4. Visualization (seeing data) – [No moderators locked in, interested? Email]
    5. Data Commons (freeing data) – Infochimps’s own Flip Kromer, together with Factual’s Gil Elbaz will lead a discussion on building a cross-domain data commons.

    The best part of this event is the people. You’ll have time to talk, eat, and network with some of the greatest minds in the data world and exchange cutting edge ideas.

    If you’re a really smart data geek, you can’t miss out on this chance to immerse yourself in the world you love. RSVP now at Afterwards, check out our Facebook event page for more information on who’s coming and the latest updates.

    None of this would be possible without our sponsors, Rackspace, Infochimps, WolframAlpha, Factual and knowmore. To all of them, thank you!

    How to create datasets that the rest of the world needs

    We recently created a dataset for the web site that is a map between IP addresses to zip codes and census demographic information. The work that was involved in this is representative of the type of community we want to have involved with Infochimps in the future. The type of people that will find this dataset useful – web site owners, internet advertisers – are not always going to be the same people that can create such a dataset. This division of labor can only happen when experts at data gathering can share their data in a place where people that want to use the data can find it.

    Our social media expert Maegan recently interviewed Carl, a member of our data team, to talk about this dataset creation process. You can find the IP-Census data he’s talking about here:

    M: Hi Carl, would you start by introducing yourself and telling us what you do for Infochimps?

    C: I’m a member of the data team here at Infochimps. Basically, the team in charge of gathering data that’s available on the web, cleaning it up and making it more useful for other people out there that are looking for this sort of data.

    M: I can imagine how appealing that data is to a lot of people. Speaking of useful data, I heard that you recently came up with a collection of datasets that link IP addresses to Census information. Can you tell me more about it?

    C: Well, we heard from a few people that that sort of thing might be interesting. There are a lot of people out there want to know more about the people that come to their website. Using this dataset, they can get demographic details by using the IP address of their visitors. That way they can improve their understanding of their audience and target the content on their website better. The dataset that we have links IP addresses to zip codes, and then zip codes to all sorts of demographic data from the Census.

    M: I saw that you have so many different types of information from the Census. Where did you go to find the data to mash together?

    C: For the Census data, that’s a fairly well-known source. The US government has a Census website,, where you can go to download all sorts of information. As far as the IP to geolocation data, there are lot of datasets available. We were looking for one that had good coverage of IP addresses, was available for free, and had a license that allowed us to take that data, do what we wanted with it and make it available on our site.

    M: Is this a new kind of dataset? Or is it available elsewhere?

    C: The IP to geolocation dataset is available from where we got it – at MaxMind. Linking that to the Census data is something that I don’t think we’ve seen elsewhere.

    M: How did the process work once you had the data?

    C: The Census data is divided into a lot of different geographic segments – national, state, city, county and all those sorts of things, but the IP geolocation data only uses zip codes. We wanted just the data from the Census that’s associated with the zip codes, so I had to comb through the Census data and pull out just the lines of the data that are associated with zip codes and then use that to match up to the IP addresses in the geolocation data.

    M: Is it just how they’re organized?

    C: Yeah, it’s more of how it’s organized. The Census data is organized into a few different files. You have one file that lists all the different breakdowns of how the data is divided up – like how I was saying, by state, city, zip code or the country. Each of those breakdowns was associated with this logical record number. Then, the actual Census data files have the logical record number at the beginning and then all the numbers associated with the different fields in the rest of the file. I had to pick out just all the logical record numbers that were associated with the zip codes in the first files and then pull all those out of the Census data to match it to the zip codes from the IP addresses.

    M: I would imagine that Census data would involve big files – did this make them difficult to manage?

    C: Yeah, the Census data files are really large and so it took a lot of space to load everything into memory. Then, I made a list of what data we needed from the Census data files and searched through them line by line to match zip codes to demographic information.

    M: That sounds like a lot of work. Did you have to do anything else to process the data?

    C: The other thing that I did was figure out the column headings to make it more useful. The way it was presented by the US Census bureau is that each column of data has a column heading that is just a code that you look up somewhere else to figure out what it actually meant. I went through and did a lot of manual editing to make the column headings more readable. Now if you just look at it, you have a better idea of what’s actually going on and it’s not just meaningless code.

    M: How did you find data with licenses that actually let you mash them?

    C: We were looking for specific datasets that had the licenses with certain properties that let you freely download, mash and mix up the data with other datasets, and sell it on your own site or do anything commercial with it. Of course, most of these licenses have attribution requirements, so we made sure to list all our sources in the dataset. The final dataset that we have available clearly says that this data originally came from the US Census Bureau and this MaxMind website.

    M: In the end, what licenses did you put on the dataset that you made?

    C: The license that is on there now is a very open license that lets users use the data for whatever they need. It is the Open Database License.

    M: Are there any other difficulties you faced?

    C: One of the issues that we wanted to make sure was cleared up was that the IP address data that we got was reliable and would cover a lot of IP addresses. It needed to have broad coverage of general IP addresses. We did a quick test and used the logs from our own website, took IP addresses from 6 months worth of page visits, and ran all those IP addresses through the IP address database. It turned out that it matched over 90% of the IP addresses that we had, and so that was a pretty good indication that the IP address dataset we had was fairly complete and had very good coverage compared to others which we heard would have only 50% coverage.

    M: Is the availability of the IP addresses a privacy concern?

    C: I don’t think it’s a privacy concern because it’s not matching it up to a specific address, but it’s matching it up to a zip code. Since zip codes have a very large number of people, it’s hard to determine if that IP address is coming from one specific person or even one specific household.

    M: Ok, thank you very much, Carl.