Twitter Census: Publishing the First of Many Datasets

As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find.

Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006. The initial datasets are a part of our Twitter Census collection.

The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org. For example, use it to look at the adoption of Google Wave on the rate of its mentions. On one payload’s page you will find a snippet with a sample taken during Kanye West’s outburst in September, and on another’s you can see that the “:)” emoticon has been used 135,000 times.

The second dataset solves a large problem developers have when they use Twitter’s Search API and the Twitter API, as each API gives back a different unique string for every user on Twitter. This dataset maps user IDs between the two API’s for 24.5 million users. This mapping should be a godsend to Twitter app developers, as it allows them to easily combine data from each API, letting API calls for friends lists mix easily with searches on the Twitter Search API.

These datasets are only views from the massive collection we have been growing over the last year. We will be releasing additional datasets regularly over the next few weeks so please check back for updates. If you’d like a custom slice or analysis done on this data, please get in touch at imw@infochimps.org.

With the release of this data, we hope to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers. This should start a conversation about where value really lies in this type of data, the various ownership and privacy issues that arise, and that Infochimps.org is the place to go to find data. We invite interested parties to get in touch and begin uploading their data(try invite code “newsupplier”) today as part of the Infochimps marketplace.

Comments

  1. Pingback: Publishing – publishing books | Point Article News Feed

  2. Pingback: ArticleSave :: Uncategorized :: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale

  3. Pingback: brad nelson › All media will be social, the term ‘Social Media’ will be irrelevant …

  4. Pingback: Publishing – book publishers | Point Article News Feed

  5. mrflip November 12, 2009 at 9:39 pm

    @James — Twitter is legendarily open, and their Terms of Service are very clear:

    “Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.”

    “We encourage and permit broad re-use of Content. The Twitter API exists to enable this.”

    (highlighting mine)

  6. James Simmons November 12, 2009 at 7:39 pm

    Aren’t you worried about getting sued?

  7. Pingback: Twitter Census: Publishing the First of Many Datasets | blog … | Twitter Tools | Internet Marketing Revolution

  8. Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | Techno Portal

  9. Pingback: Ly Technology » Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale

  10. Pingback: Publishing – book publishing | Point Article News Feed

  11. Pingback: Tech News World » Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale

  12. Pingback: Techeroid » Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale

  13. Pingback: Twitter Census: Publishing the First of Many Datasets | blog …

  14. Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | UpOff.com

  15. Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | Programming Blog

  16. Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | TechTerminal

  17. Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | Stoth

  18. Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale

  19. Pingback: Twitter Data Dump: InfoChimp Puts 1B Connections Up For Sale : ajaxremix

  20. Pingback: ajf7688 Blog - Twitter Census: Publishing the First of Many Datasets | blog …

  21. Tony Zito November 11, 2009 at 11:25 pm

    Wow, this is HUGE. Charging for access to the historical record was a potential revenue stream I identified back in February — http://citrusfortress.com/wp/2009/02/how-twitter-could-start-making-money-now-without-fucking-up-a-very-very-good-thing/ … Nice work, guys, and best of luck!

    (Moderator, please delete my previous post; this one has the correct link. Thanks!)

  22. Pingback: Publishing – publishers | Point Article News Feed

  23. Joseph Kelly November 11, 2009 at 9:13 pm

    The data has been collected from the API for a year. The entire collection is over half a terabyte.

  24. Jeremy Dunck November 11, 2009 at 9:03 pm

    Curious– how did you get access to this data? Or have you been (only) sampling data via the API for a long while now?

  25. Pingback: Tweets that mention Twitter Census: Publishing the First of Many Datasets | blog.infochimps.org -- Topsy.com