Announcing bulk redistribution of MySpace data

Today, we’re excited to announce the availability of MySpace data for bulk download on Infochimps. This is a major step forward for Myspace in our eyes – a move that signifies their seriousness about data and the developers and academics that work with that data.

This data is not sold by MySpace, but given out for free from their API and then packaged by Infochimps for redistribution. By giving developers free access to publically available real-time data (such as status updates, music, photos, videos) MySpace reinforces its commitment to powering the real-time social Web and the development of open standards.

  • Every day, MySpace processes over 32 million activities and updates
  • MySpace opened up its real-time data with free-to-use APIs letting developers create robust products
  • MySpace offers more scale and richer content like music, photo, videos, apps than anyone else
  • Real-time data input and the ability to then share that in real-time will drive the socialization of content on the web
  • Data available for bulk download will help usher the next generation of data-driven research and application development. Now, using a dataset like word count by hour, developers and content providers can better understand how things are talked about and when.

    The benefits of having data available for bulk download instead of just an API are numerous. Developers can start with a sample dataset and get their apps started faster. Academics are much better served by a .csv than an API, and developers can take advantage of the datasets these experts create as a result of their research. Opening one’s data to the big data community makes all this and much more possible.

    API’s aren’t enough. New tools like Hadoop allow for the processing of huge datasets but necessitate having a local copy of the entire dataset. The advanced analytics that come from computing on top of a huge dataset (and at 25GB/day the MySpace stream is massive) will power the next generation of applications.

    The developers looking for this data can come to Infochimps to find the data they need. Let’s harbor a division of labor between the people who are experts in mining this data for insight, and the pros who can develop the applications on top of those discoveries. For example, Ryan Rosario of UCLA created a dataset of user’s moods by zip code, a historical emotional context for researchers, psychologists, and possibly a developer looking to take advantage of this MySpace feature.

    We’ll premiere the “best of MySpace” datasets in the hopes of supporting a relationship between MySpace and data-driven research and development. And any API owners out there should get in touch to talk about how we can make your data computable for the big data community.

    UPDATE: Here is a visualization of Users with geolocations from our dataset, User locations by lat/long:
    figure Announcing bulk redistribution of MySpace data