Massive Scrape of Twitter’s Friend Graph

UPDATE:

We’ve posted several Twitter datasets on Infochimps. Take a look and build something cool!

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

THE FILES ARE HUGE.  They will, in principle, work with anything that can import a spreadsheet-style TSV file. But if you try to load a 56-million row dataset into Excel it will burst into flames. So will most tools; even opening an explorer/finder window on the ripd/_xxxx directories will fill you with regret.

If you have access to a cluster with Hadoop and Pig it’s highly, highly recommended. Otherwise, the files will load straight into MySQL using LOAD DATA INFILE (and assumedly other DBs as well).  Industrial-strength products such as Mathematica and Cytoscape will struggle, but can handle good-sized subsets.  And don’t worry: besides featuring it on infochimps.org when we launch, once this dataset is mature we’ll move the raw data onto Amazon Public Datasets and archive.org. (At which point Amazon can serve as your Hadoop cluster, if you’ve got the lucre.)

Description of objects and fields:

All the files are Tab Separated (TSV) files.

  • Users:
    • Partial Users: 8.1 million sightings of 2.7M unique users. When you ask for a user’s following / friends list, or in the public timeline tweets, you get a partial listing of each user. This table lists each unique state observed: If @infochimps was seen on the 10th, the 15th, and the 16th, with resp. 80, 80 and 82 followers  (everything else the same) you’ll get the twitter_user_partial records of the 10th and the 16th.
      Fields: [:id],  id,  :screen_name, :followers_count, :protected, :name, :url, :location, :description, :profile_image_url
    • Users: 2.2M full user records
      Fields: [:id],  :id,  :screen_name, :created_at, :statuses_count, :followers_count, :friends_count, :favourites_count, :protected
    • User Profile: descriptive info for each user with a full record
      [:id],  :id,  :name, :url, :location, :description, :time_zone, :utc_offset
    • User Styles: image and color settings for each user with a full record
      [:id],  :id,  :profile_background_color, :profile_text_color, :profile_link_color, :profile_sidebar_border_color, :profile_sidebar_fill_color, :profile_background_image_url, :profile_image_url, :profile_background_tile
  • Relationships:
    • A follows B: 58M; User A follows User B.
      Fields: [:user_a_id, :user_b_id], :user_a_id, :user_b_id
    • A atsigns B: 3.0M; User A @atsigned User B anywhere in the tweet. Not threaded, and currently only carries User B’s screen_name, not ID.
      Fields: [:user_a_id, :user_b_name, :status_id], :user_a_id,
      :user_b_name, :status_id
    • A replies B: 2.5M; User A replied explicitly to User B (Tweet carries the in_reply_to_status_id).
      Fields: [:user_a_id, :user_b_id,   :status_id], :user_a_id, :user_b_id,   :status_id, :in_reply_to_status_id
    • A retweets B: (coming soon)
  • Tweets & Contents:
    • Tweets: 10.1M unique tweets.  A bunch more coming from the public datamining feed.
      Fields: [:id],  :id,  :created_at, :twitter_user_id, :text, :favorited, :truncated, :tweet_len, :in_reply_to_user_id, :in_reply_to_status_id, :fromsource, :fromsource_url
    • Hashtags: 220k #hashtags collected from tweets
      Fields: [:user_a_id, :hashtag], :user_a_id, :hashtag, :status_id
    • Tweet URLs: 2.1M urls collected from tweets
      Fields: [:user_a_id, :tweet_url], :user_a_id, :tweet_url, :status_id
    • Expanded TinyURL’ishes: 1.0M redirect targets for tinyurl.com, is.gd, bit.ly, etc.
      Fields: short_url, dest_url
  • Metrics:
    • Pagerank: A measure of ‘prestige’ determined by network flow; this is the algorithm Google uses to weight web pages.
      Fields: :user_id, :pagerank, :ids_that_user_follows

Random Notes:

  1. Make sure your language doesn’t interpret a zero-padded twitter_user.id (or status_id) like ’000000000072′ as octal 58.
  2. There may be inconsistent user data for all-numeric screen_names: see
    http://code.google.com/p/twitter-api/issues/detail?id=162
    That is, the data in this scrape may commingle information on the user having screen_name ’415′ with that of the user having id #415. Not much we can do bout it, but we plan to scrub that data later.
  3. Watch out for some ill-formed screen_names: see http://code.google.com/p/twitter-api/issues/detail?id=209
  4. For the parsed data: act as if we’ve double-checked none of this. If you have questions please ask, though: help@infochimps.org
  5. The scraped files (ripd-_xxxxxxxx.tar.bz2) are exactly as they came off the server.
  6. Pagerank is non-normalized — divide by N and take the log.
  7. The files are huge, and the ripd-_xxxx directories will make your filesystem cry. We recommend hadoop.