Joins the Bunch; Brings 30 Million+ News Headlines & Summaries from 2009-2011

303167 300 Joins the Bunch; Brings 30 Million+ News Headlines & Summaries from 2009 2011Hello fellow data monkeys,

A few weeks ago, Infochimps and completed a collaboration to release nearly 30 million news headlines and summaries from 2009-2011 in a nicely-structured JSON dump. This is data that’s crawlers have collected over the last 2 years from over 500,000 web news sources. I am a cofounder of and was the lead engineer who worked on making the data dump happen.

We have been receiving some questions about this data, so I thought it’d be helpful to give some background via this guest blog post. It’s also great timing: the whole team has just returned from a trip to Austin that included a stop at the Infochimps world headquarters. Let’s not let this opportunity for big data collaboration slip away!

OK, so what’s

parsely 800px Joins the Bunch; Brings 30 Million+ News Headlines & Summaries from 2009 2011 is a NYC-based technology startup that provides data insights to the web’s best publishers: including many that you likely read everyday. Editors and writers from these publishers use our tools and analytics to make better decisions every day.

But didn’t start out that way. Before our shift to a publisher tools company, our small team put together a personalized news reader application as a demonstration of our backend technology during the Dreamit Ventures incubator program. This was called the Reader. This application didn’t go anywhere. Despite write-ups from the tech press and a few thousand users, we knew it was more a demonstration/experiment than a foundation we could build our business on. You can check out a demo video here for a little time capsule of what it did and how it worked: Reader Video

Even as our company’s priorities shifted toward our core business value, we kept the Reader running. We did this to ensure we still had meaningful data to apply our data-driven natural language processing technology (see my slides on doing NLP with Python) and because it was just a fun product to run and use. We quietly shut the Reader down a few months ago and are now firmly a software-as-a-service company focused on the needs of the web’s best online content publishers and media companies.

One of our last acts before shutting the Reader down, though, was to make the valuable data we collected available to the world. Then, last month, this dataset was also field-tested at the HackNY Hackathon, where some teams built their projects atop our data dump. Seeing a widespread need for it, we thought, what better way to get this data out there than to partner with the web’s leading data marketplace, Infochimps?

What pain does the data dump solve?

When we were first starting up, we noticed that there were really no good ways to get a high-quality, structured news data at a scale that would be meaningful for statistical analysis, research, and prototyping. Your two choices were: hit some real-time and search-oriented sources, such as Daylife ( is a Daylife partner) or source-specific APIs such as Guardian Open Platform. Indeed, we wrote drivers to play with each of these. See engineer Didier Deshommes’ driver for Daylife and my driver for Guardian. The problem with each of these is that it is difficult to download large amounts of data from these sources (e.g. >1GB, >1M articles) in any reasonable time frame without violating Terms of Service or rate limits.

So, what you are left with is the “do-it-yourself” option: a costly process of building your own RSS/Atom feed processor and crawling, storing, and structuring the content yourself.

Finally, even if you decide you are willing to take the risk to build your own RSS/Atom infrastructure (perhaps atop excellent modules like feedparser), you will quickly run into these tough questions:

  • What sources do I crawl to get a representative sample of content?: The web is wide and deep. Picking a good set of seed sources is a difficult task, and often involves writing your own web crawler, and even hairier task. The data dump provides nearly 500,000 good news/blog web sources that you could use as an excellent seed set.
  • How do I handle unicode and text encoding issues?: The web is a messy place. Any good crawling infrastructure will need to worry about the various text encodings in play. This is not glorious work, and can lead to frustrating production issues. The data dump gives a large enough sample of articles that you can nip these problems in the bud.
  • How do I deduplicate all of this content?: This problem comes up with news data all the time, and can make products look very, very bad. The data dump includes “near-duplicates” but allows you to detect them easily using our “signature_hash” field. This will allow you to train up deduplication algorithms that are effective at scale.
  • Where do I find a non-academic corpus of text data?: Though modules like NLTK provide excellent academic corpuses of text data (such as Gutenberg Project’s novel corpus or the Brown Corpus), it and other tools like it lack corpus data that is representative of news stories “in the wild”. You could use our data dump to build up a truly significant stopword/common term list, for example, or scores, or collocation lists.
  • Will my system or database scale?: simply loading this data into MongoDBPostgresql Solr, or any other data store will allow you to test characteristics of that system at a scale that is likely to mirror “web scale” conditions, e.g., How big will an index on a text field be with 30 million news headlines? How long will backups take? How much disk storage can be saved by compression? … and a host of other questions!

Use Cases

By providing nearly 30M news headlines in a structured format with enough metadata to effectively parse, analyze, and deduplicate them, we have provided a significant slice of web content that can be used for building your proprietary systems. By providing URLs and labels for nearly 500,000 web sources, we have provided an excellent starting point for building your own in-house web crawler. Either of these tasks would be painful to do on your own. Trust us, we know from experience. For the low one-time cost of $350, you can turbocharge your next project that is meant to leverage big news/web data, and work out data scale issues in your system ahead of time.

Finally, for those who are more on the analyst/designer side of the spectrum, this data provides an opportunity to produce some beautiful visualizations, or confirm/disprove assumptions about the relative coverage of different news topics.

Why you should buy this dataset now

As a special introductory offer, the first 5 people to buy the dataset will be invited to a 1-hour interactive web-based webinar with me,’s CTO, to discuss how to solve NLP, information retrieval, and big news data problems with this dataset at your side. This offer will expire on December 1, 2011. Let us know if you have any questions, and we hope our data can help you succeed!

andrew Joins the Bunch; Brings 30 Million+ News Headlines & Summaries from 2009 2011Good luck, and happy hacking!

Andrew Montalenti
Co-Founder & CTO

Comments are closed.