Massive Scrape of Twitter’s Friend Graph

UPDATE:

We’ve posted several Twitter datasets on Infochimps. Take a look and build something cool!

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

THE FILES ARE HUGE.  They will, in principle, work with anything that can import a spreadsheet-style TSV file. But if you try to load a 56-million row dataset into Excel it will burst into flames. So will most tools; even opening an explorer/finder window on the ripd/_xxxx directories will fill you with regret.

If you have access to a cluster with Hadoop and Pig it’s highly, highly recommended. Otherwise, the files will load straight into MySQL using LOAD DATA INFILE (and assumedly other DBs as well).  Industrial-strength products such as Mathematica and Cytoscape will struggle, but can handle good-sized subsets.  And don’t worry: besides featuring it on infochimps.org when we launch, once this dataset is mature we’ll move the raw data onto Amazon Public Datasets and archive.org. (At which point Amazon can serve as your Hadoop cluster, if you’ve got the lucre.)

Description of objects and fields:

All the files are Tab Separated (TSV) files.

  • Users:
    • Partial Users: 8.1 million sightings of 2.7M unique users. When you ask for a user’s following / friends list, or in the public timeline tweets, you get a partial listing of each user. This table lists each unique state observed: If @infochimps was seen on the 10th, the 15th, and the 16th, with resp. 80, 80 and 82 followers  (everything else the same) you’ll get the twitter_user_partial records of the 10th and the 16th.
      Fields: [:id],  id,  :screen_name, :followers_count, :protected, :name, :url, :location, :description, :profile_image_url
    • Users: 2.2M full user records
      Fields: [:id],  :id,  :screen_name, :created_at, :statuses_count, :followers_count, :friends_count, :favourites_count, :protected
    • User Profile: descriptive info for each user with a full record
      [:id],  :id,  :name, :url, :location, :description, :time_zone, :utc_offset
    • User Styles: image and color settings for each user with a full record
      [:id],  :id,  :profile_background_color, :profile_text_color, :profile_link_color, :profile_sidebar_border_color, :profile_sidebar_fill_color, :profile_background_image_url, :profile_image_url, :profile_background_tile
  • Relationships:
    • A follows B: 58M; User A follows User B.
      Fields: [:user_a_id, :user_b_id], :user_a_id, :user_b_id
    • A atsigns B: 3.0M; User A @atsigned User B anywhere in the tweet. Not threaded, and currently only carries User B’s screen_name, not ID.
      Fields: [:user_a_id, :user_b_name, :status_id], :user_a_id,
      :user_b_name, :status_id
    • A replies B: 2.5M; User A replied explicitly to User B (Tweet carries the in_reply_to_status_id).
      Fields: [:user_a_id, :user_b_id,   :status_id], :user_a_id, :user_b_id,   :status_id, :in_reply_to_status_id
    • A retweets B: (coming soon)
  • Tweets & Contents:
    • Tweets: 10.1M unique tweets.  A bunch more coming from the public datamining feed.
      Fields: [:id],  :id,  :created_at, :twitter_user_id, :text, :favorited, :truncated, :tweet_len, :in_reply_to_user_id, :in_reply_to_status_id, :fromsource, :fromsource_url
    • Hashtags: 220k #hashtags collected from tweets
      Fields: [:user_a_id, :hashtag], :user_a_id, :hashtag, :status_id
    • Tweet URLs: 2.1M urls collected from tweets
      Fields: [:user_a_id, :tweet_url], :user_a_id, :tweet_url, :status_id
    • Expanded TinyURL’ishes: 1.0M redirect targets for tinyurl.com, is.gd, bit.ly, etc.
      Fields: short_url, dest_url
  • Metrics:
    • Pagerank: A measure of ‘prestige’ determined by network flow; this is the algorithm Google uses to weight web pages.
      Fields: :user_id, :pagerank, :ids_that_user_follows

Random Notes:

  1. Make sure your language doesn’t interpret a zero-padded twitter_user.id (or status_id) like ’000000000072′ as octal 58.
  2. There may be inconsistent user data for all-numeric screen_names: see
    http://code.google.com/p/twitter-api/issues/detail?id=162
    That is, the data in this scrape may commingle information on the user having screen_name ’415′ with that of the user having id #415. Not much we can do bout it, but we plan to scrub that data later.
  3. Watch out for some ill-formed screen_names: see http://code.google.com/p/twitter-api/issues/detail?id=209
  4. For the parsed data: act as if we’ve double-checked none of this. If you have questions please ask, though: help@infochimps.org
  5. The scraped files (ripd-_xxxxxxxx.tar.bz2) are exactly as they came off the server.
  6. Pagerank is non-normalized — divide by N and take the log.
  7. The files are huge, and the ripd-_xxxx directories will make your filesystem cry. We recommend hadoop.

Comments

  1. Baskaran Sankaran June 26, 2009 at 5:27 pm

    I am one of the users eagerly waiting for its release. Can you tell when/ if it will be published again?

  2. Tom June 20, 2009 at 4:09 am

    Just gloating about how right I was. Twitter is currently telling people that they’re so overloaded with support questions that instead of marking their question as ‘enhancement’ or something to do later, they’re just saying “yeah, we don’t do that anymore, contrary to our announcement stating the opposite and no retraction of that announcement. hehe.”

    Twitter is not a supportive environment: http://statsheet.com/blog/thanks-to-twitter-im-creating-my-own-twitter

  3. Gianni May 26, 2009 at 1:49 am

    Hi everybody…! Any news or idea if the data will be released again or not…?

  4. Yutaka May 22, 2009 at 1:47 am

    I wonder when could it be possible to get this interesting dataset. Any news from Twitter?

  5. Andrew April 6, 2009 at 9:58 pm

    Any update on whether this dataset can be released?

  6. Paul March 17, 2009 at 9:45 am

    mrflip, do you have any idea of when it will be released?

  7. Paul March 10, 2009 at 3:21 am

    Where is it? Pleaseeeeee

  8. Rod February 18, 2009 at 5:45 pm

    Will this ever be available again? This dataset would help me out a lot with my research paper… but I need it soon :(

  9. Cassio February 9, 2009 at 10:20 am

    Philip, do you have any idea of when this dataset is going to be released again? I’m *really* anxious.

  10. mrflip January 25, 2009 at 5:54 pm

    I’m not interested in dancing around their reasonable interest in managing their users’ data.

    And I’ve found the tech support highly responsive and better than most.

  11. Tom January 25, 2009 at 3:15 pm

    Oh, hey, you posted the same thing here as in my (misdirected) e-mail. I guess I have to return in kind (cropped to remove e-mail specific parts, but not edited)

    In my experience, Twitter loves to pass the buck. For example, there’s an error in their javascript that actually breaks hyperlinks inside of tweets because it adds spaces randomly inside a tweet’s text. @al3x refuses to acknowledge it’s a twitter problem, and refuses to assist in determining just how it’s a Firefox problem (as he claims it to be). Or, “It’s not my fault, but I won’t show you how I know it’s not my fault, because it’s actually my fault and I don’t want to reveal that”.

    How about this: you could use their new oauth (or old http auth) to allow users to download their _own_ data from this scrape. I have been unable to find any of my early tweets, and twitter’s paging limits me to something like 600 tweets. I fear I may have to do a massive scrape like you did, then piece together all the tweets by specific user ids in packages, and let people download theirs after proving who they are.

    Just because you were able to increment a number and do a show command on tweets, doesn’t mean that was the intended purpose. In fact, I wouldn’t be surprised if they write some software to catch people doing just that and start feeding them either bunk data or no data at all.

    With regards to the Waldman brouhaha, I also sided against her for a variety of reasons. She had a lot to gain by making Twitter look bad by being Pownce’s community manager. This is like a woman from Apple claiming to have been sexually assaulted on the Microsoft campus, conveniently out of sight of a security camera, by someone she didn’t get a good glimpse of. It’s very hard to believe. However, bloggers everywhere rallied against Twitter, just like she’d hoped.

    Finally, I think the dev team have forgotten about your experiment. You might want to ping them and see what the status is.

  12. mrflip January 24, 2009 at 9:52 pm

    I think, and I hope, that my confidence is very well-placed.

    The fact is that this bulk data existed and exists quite apart from my scrape… Their API limits are set to just under half-million requests per day (so, up to 5million partial-user records), not to mention whatever HTML scraping people get away with. Every piece of this is data available free to the public from theirs, and any dozen of other independent APIs.

    The questions at hand is: do they want random people collecting and distributing sub-rosa copies of the data, outside of their view and with no formal agreement, at great cost in servers and monitoring? /We/ asked and got permission, having offered to coordinate any necessary restrictions, and immediately took it down when they reconsidered. The next person might not do any of that.

    Our offer is to host, share and index the data; to put it in view of the smartest researchers in the world (I’ve been contacted by a number of prominent researchers eager to get at this); to ensure that users understand the associated terms of service; to place it in an environment where others will link it with the rest of the world’s free data; and to make it easily available to people building tools that need SOME now and will gladly pay for ALL later.

    And to do all this respecting the value of what they’ve shared and recognizing their interest in seeing others use it responsibly.

    You pointed to the Waldman brouhaha to show something something about their user-facing Terms of Service. I actually side with how Twitter handled it; regardless, they’ve grown and thrived by courageously adopting very open policies. Their dev team are engineers and understand the value of opening this data. The hunger for this data is enormous, and the larger community is one that values openness and is not shy about applying pressure in that direction.

  13. Tom January 24, 2009 at 2:24 pm

    Twitter has no intention of ever giving you permission to put this back up. Heck, they don’t even have a proper terms of service themselves (just search [twitter waldman]).

  14. Breyten January 19, 2009 at 10:39 am

    Any news on when the files will be back up?

  15. Pingback: New text resources available « HLP/Jaeger lab blog

  16. Pingback: The Asdrubal Carrera Hall of Fame « blog.infochimps.org - Organizing Huge Information Sources

  17. Daniel Becker January 6, 2009 at 2:58 am

    Maybe this would be a good use for Amazons new Public Data Sets feature. http://aws.amazon.com/publicdatasets/

  18. Michael Wilde January 5, 2009 at 11:52 pm

    I cant wait to eat this data with Splunk. 10′s of millions of records are nothing for a splunk server to eat. Check out what i did with just an 5 minute ping on search.twitter.com looking for Boxee viewing.

    http://blogs.splunk.com/thewilde/2008/12/29/its-time-for-a-boxee-ing-match-with-splunk-2/

    You guys should download Splunk and have it eat that data (if i don’t get to it first)

  19. Nicole Simon January 5, 2009 at 11:34 pm

    Please do ping people when the data is back online. ;)

    http://twitter.com/nicolesimon

  20. Anon January 5, 2009 at 7:14 pm
  21. Pingback: Reading digest for 12/29/08 to 01/05/09 « Andy Oakley

  22. mrflip January 4, 2009 at 8:44 am

    I’m waiting for re-approval of distribution, which I hope to get by end of week.

    Once that is set up, if you’d like to help set up bittorrent distribution please get in touch: flip at infochimps.org

  23. Pascal GANAYE January 2, 2009 at 6:27 pm

    Can you provide a bittorrent link for this download?
    This would minimize greatly the load on your server and allow people to download.
    (http://blog.infochimps.org/2008/12/29/massive-scrape-of-twitters-friend-graph/)

  24. Pingback: Beautiful, Beautiful Data | How We Know Us

  25. Ryan December 29, 2008 at 10:34 pm

    Thanks for this post! Much easier to read :-)

    I didn’t think of loading this into MySQL. Will give it a try! :)