Massive Scrape of Twitter’s Friend Graph
- December 29, 2008
UPDATE:
We’ve posted several Twitter datasets on Infochimps. Take a look and build something cool!
UPDATE:
We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”
The infochimps have gathered a massive scrape of the Twitter friend graph. Right now it weighs in at
- about 2.7M users: we have most of the “giant component”
- 10M tweets
- 58M edges
(These and other details will be updated as further drafts are released. See below for technical info). This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it. We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev. We’ll also have a much larger dump of tweets off the public datamining feed.
The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.
Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.
THE FILES ARE HUGE. They will, in principle, work with anything that can import a spreadsheet-style TSV file. But if you try to load a 56-million row dataset into Excel it will burst into flames. So will most tools; even opening an explorer/finder window on the ripd/_xxxx directories will fill you with regret.
If you have access to a cluster with Hadoop and Pig it’s highly, highly recommended. Otherwise, the files will load straight into MySQL using LOAD DATA INFILE (and assumedly other DBs as well). Industrial-strength products such as Mathematica and Cytoscape will struggle, but can handle good-sized subsets. And don’t worry: besides featuring it on infochimps.org when we launch, once this dataset is mature we’ll move the raw data onto Amazon Public Datasets and archive.org. (At which point Amazon can serve as your Hadoop cluster, if you’ve got the lucre.)
Description of objects and fields:
All the files are Tab Separated (TSV) files.
- Users:
- Partial Users: 8.1 million sightings of 2.7M unique users. When you ask for a user’s following / friends list, or in the public timeline tweets, you get a partial listing of each user. This table lists each unique state observed: If @infochimps was seen on the 10th, the 15th, and the 16th, with resp. 80, 80 and 82 followers (everything else the same) you’ll get the twitter_user_partial records of the 10th and the 16th.
Fields: [:id], id, :screen_name, :followers_count, :protected, :name, :url, :location, :description, :profile_image_url - Users: 2.2M full user records
Fields: [:id], :id, :screen_name, :created_at, :statuses_count, :followers_count, :friends_count, :favourites_count, :protected - User Profile: descriptive info for each user with a full record
[:id], :id, :name, :url, :location, :description, :time_zone, :utc_offset - User Styles: image and color settings for each user with a full record
[:id], :id, :profile_background_color, :profile_text_color, :profile_link_color, :profile_sidebar_border_color, :profile_sidebar_fill_color, :profile_background_image_url, :profile_image_url, :profile_background_tile
- Partial Users: 8.1 million sightings of 2.7M unique users. When you ask for a user’s following / friends list, or in the public timeline tweets, you get a partial listing of each user. This table lists each unique state observed: If @infochimps was seen on the 10th, the 15th, and the 16th, with resp. 80, 80 and 82 followers (everything else the same) you’ll get the twitter_user_partial records of the 10th and the 16th.
- Relationships:
- A follows B: 58M; User A follows User B.
Fields: [:user_a_id, :user_b_id], :user_a_id, :user_b_id - A atsigns B: 3.0M; User A @atsigned User B anywhere in the tweet. Not threaded, and currently only carries User B’s screen_name, not ID.
Fields: [:user_a_id, :user_b_name, :status_id], :user_a_id,
:user_b_name, :status_id - A replies B: 2.5M; User A replied explicitly to User B (Tweet carries the in_reply_to_status_id).
Fields: [:user_a_id, :user_b_id, :status_id], :user_a_id, :user_b_id, :status_id, :in_reply_to_status_id - A retweets B: (coming soon)
- A follows B: 58M; User A follows User B.
- Tweets & Contents:
- Tweets: 10.1M unique tweets. A bunch more coming from the public datamining feed.
Fields: [:id], :id, :created_at, :twitter_user_id, :text, :favorited, :truncated, :tweet_len, :in_reply_to_user_id, :in_reply_to_status_id, :fromsource, :fromsource_url - Hashtags: 220k #hashtags collected from tweets
Fields: [:user_a_id, :hashtag], :user_a_id, :hashtag, :status_id - Tweet URLs: 2.1M urls collected from tweets
Fields: [:user_a_id, :tweet_url], :user_a_id, :tweet_url, :status_id - Expanded TinyURL’ishes: 1.0M redirect targets for tinyurl.com, is.gd, bit.ly, etc.
Fields: short_url, dest_url
- Tweets: 10.1M unique tweets. A bunch more coming from the public datamining feed.
- Metrics:
- Pagerank: A measure of ‘prestige’ determined by network flow; this is the algorithm Google uses to weight web pages.
Fields: :user_id, :pagerank, :ids_that_user_follows
- Pagerank: A measure of ‘prestige’ determined by network flow; this is the algorithm Google uses to weight web pages.
Random Notes:
- Make sure your language doesn’t interpret a zero-padded twitter_user.id (or status_id) like ’000000000072′ as octal 58.
- There may be inconsistent user data for all-numeric screen_names: see
http://code.google.com/p/twitter-api/issues/detail?id=162
That is, the data in this scrape may commingle information on the user having screen_name ’415′ with that of the user having id #415. Not much we can do bout it, but we plan to scrub that data later. - Watch out for some ill-formed screen_names: see http://code.google.com/p/twitter-api/issues/detail?id=209
- For the parsed data: act as if we’ve double-checked none of this. If you have questions please ask, though: help@infochimps.org
- The scraped files (ripd-_xxxxxxxx.tar.bz2) are exactly as they came off the server.
- Pagerank is non-normalized — divide by N and take the log.
- The files are huge, and the ripd-_xxxx directories will make your filesystem cry. We recommend hadoop.

Thanks for this post! Much easier to read :-)
I didn’t think of loading this into MySQL. Will give it a try! :)
Pingback: Beautiful, Beautiful Data | How We Know Us
Can you provide a bittorrent link for this download?
This would minimize greatly the load on your server and allow people to download.
(http://blog.infochimps.org/2008/12/29/massive-scrape-of-twitters-friend-graph/)
I’m waiting for re-approval of distribution, which I hope to get by end of week.
Once that is set up, if you’d like to help set up bittorrent distribution please get in touch: flip at infochimps.org
Pingback: Reading digest for 12/29/08 to 01/05/09 « Andy Oakley
Interesting analysis coming out of this data here:
http://blog.cli.gs/news/analysis-of-linking-patterns-on-twitter-cligs-scores-well
Please do ping people when the data is back online. ;)
http://twitter.com/nicolesimon
I cant wait to eat this data with Splunk. 10′s of millions of records are nothing for a splunk server to eat. Check out what i did with just an 5 minute ping on search.twitter.com looking for Boxee viewing.
http://blogs.splunk.com/thewilde/2008/12/29/its-time-for-a-boxee-ing-match-with-splunk-2/
You guys should download Splunk and have it eat that data (if i don’t get to it first)
Maybe this would be a good use for Amazons new Public Data Sets feature. http://aws.amazon.com/publicdatasets/
Pingback: The Asdrubal Carrera Hall of Fame « blog.infochimps.org - Organizing Huge Information Sources
Pingback: New text resources available « HLP/Jaeger lab blog
Any news on when the files will be back up?
Twitter has no intention of ever giving you permission to put this back up. Heck, they don’t even have a proper terms of service themselves (just search [twitter waldman]).
I think, and I hope, that my confidence is very well-placed.
The fact is that this bulk data existed and exists quite apart from my scrape… Their API limits are set to just under half-million requests per day (so, up to 5million partial-user records), not to mention whatever HTML scraping people get away with. Every piece of this is data available free to the public from theirs, and any dozen of other independent APIs.
The questions at hand is: do they want random people collecting and distributing sub-rosa copies of the data, outside of their view and with no formal agreement, at great cost in servers and monitoring? /We/ asked and got permission, having offered to coordinate any necessary restrictions, and immediately took it down when they reconsidered. The next person might not do any of that.
Our offer is to host, share and index the data; to put it in view of the smartest researchers in the world (I’ve been contacted by a number of prominent researchers eager to get at this); to ensure that users understand the associated terms of service; to place it in an environment where others will link it with the rest of the world’s free data; and to make it easily available to people building tools that need SOME now and will gladly pay for ALL later.
And to do all this respecting the value of what they’ve shared and recognizing their interest in seeing others use it responsibly.
You pointed to the Waldman brouhaha to show something something about their user-facing Terms of Service. I actually side with how Twitter handled it; regardless, they’ve grown and thrived by courageously adopting very open policies. Their dev team are engineers and understand the value of opening this data. The hunger for this data is enormous, and the larger community is one that values openness and is not shy about applying pressure in that direction.
Oh, hey, you posted the same thing here as in my (misdirected) e-mail. I guess I have to return in kind (cropped to remove e-mail specific parts, but not edited)
In my experience, Twitter loves to pass the buck. For example, there’s an error in their javascript that actually breaks hyperlinks inside of tweets because it adds spaces randomly inside a tweet’s text. @al3x refuses to acknowledge it’s a twitter problem, and refuses to assist in determining just how it’s a Firefox problem (as he claims it to be). Or, “It’s not my fault, but I won’t show you how I know it’s not my fault, because it’s actually my fault and I don’t want to reveal that”.
How about this: you could use their new oauth (or old http auth) to allow users to download their _own_ data from this scrape. I have been unable to find any of my early tweets, and twitter’s paging limits me to something like 600 tweets. I fear I may have to do a massive scrape like you did, then piece together all the tweets by specific user ids in packages, and let people download theirs after proving who they are.
Just because you were able to increment a number and do a show command on tweets, doesn’t mean that was the intended purpose. In fact, I wouldn’t be surprised if they write some software to catch people doing just that and start feeding them either bunk data or no data at all.
With regards to the Waldman brouhaha, I also sided against her for a variety of reasons. She had a lot to gain by making Twitter look bad by being Pownce’s community manager. This is like a woman from Apple claiming to have been sexually assaulted on the Microsoft campus, conveniently out of sight of a security camera, by someone she didn’t get a good glimpse of. It’s very hard to believe. However, bloggers everywhere rallied against Twitter, just like she’d hoped.
Finally, I think the dev team have forgotten about your experiment. You might want to ping them and see what the status is.
I’m not interested in dancing around their reasonable interest in managing their users’ data.
And I’ve found the tech support highly responsive and better than most.
Philip, do you have any idea of when this dataset is going to be released again? I’m *really* anxious.
Will this ever be available again? This dataset would help me out a lot with my research paper… but I need it soon :(
Where is it? Pleaseeeeee
mrflip, do you have any idea of when it will be released?
Any update on whether this dataset can be released?
I wonder when could it be possible to get this interesting dataset. Any news from Twitter?
Hi everybody…! Any news or idea if the data will be released again or not…?
Just gloating about how right I was. Twitter is currently telling people that they’re so overloaded with support questions that instead of marking their question as ‘enhancement’ or something to do later, they’re just saying “yeah, we don’t do that anymore, contrary to our announcement stating the opposite and no retraction of that announcement. hehe.”
Twitter is not a supportive environment: http://statsheet.com/blog/thanks-to-twitter-im-creating-my-own-twitter
I am one of the users eagerly waiting for its release. Can you tell when/ if it will be published again?
Pingback: SXSW Data Panels « blog.infochimps.org – Organizing Huge Information Sources
Pingback: Twitter Trackbacks for Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org - Organizing Huge Information Sources [infochimps.org] on Topsy.com
Pingback: Is Infochimps’ Aggregated Data a Boon to Researchers or a Privacy Nightmare?
Pingback: Tech News World » Twitter Data Dump: InfoChimp Puts 1B Connections Up For Sale
Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | Newsfed - Aggregate local and tech stories with related videos and tweets!
Pingback: Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale | Programming Blog
Pingback: Techeroid » Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale
Pingback: Planner Reads » Blog Archive » Is Infochimps’ Aggregated Data a Boon to Researchers or a Privacy Nightmare?
Pingback: Is Infochimps’ Aggregated Data a Boon to Researchers or a Privacy Nightmare? | 4nasco Technology
This data set is still very interesting! Please attempt to repost!
Pingback: Is Making Public Data “More Public” a Privacy Violation? « 33 Bits of Entropy
Pingback: Where can I find sample social network analysis data sets? - Quora
Pingback: What is the largest public dataset for classification? - Quora