Monthly Archives November 2010

Strata: The Business of Data conference from O’Reilly

Infochimps has been engrained in the emerging data ecosystem since our founding in 2008. But while having met and talked to a lot of people in everything from open data to viz to academia, much of the talk around BigData has to do more with the technologies (Hadoop, Cassandra, etc.) rather than the data and the money around it.  We, at Infochimps, are very keen to this disparity and are therefore very happy to share that the  good folks at O’Reilly have put together a conference that brings all members of the data community together at long last.

Strata is a new conference from O’Reilly Media, focused on the business and practice of data, happening February 1-3 in Santa Clara, CA.

We are particularly excited about this event because of its theme of collecting and using data successfully. If you dig into the event website and register yourselves, you’ll see that the conference shakes out into “three days of training, breakout sessions and plenary discussions — along with an Executive Summit, a Sponsor Pavilion and other events showcasing the new data ecosystem”.

Registration opened last week, get your passes while you can still register early: https://en.oreilly.com/strata2011/public/register Use this discount code to save 15% on the price: str11ifc

And, though speaker submissions have closed, you can still submit a Birds of a Feather session here: http://strataconf.com/strata2011/public/cfp/132

Big thank you to O’Reilly Media for putting a stake in the ground that this new ecosystem can form around.  Many of our friends and familiar faces will be there, including Drew Conway, Amber Case, Bradford Cross, Mike Olson, and Jud Valeski.  We are very excited about the conference and see this as a huge sign that the data ecosystem really gelling.

See you there!

Republican and Democrat Words.

So here at Infochimps, we thought it would be interesting and timely to take a look at some of the things that are being said about the candidates.  With Twitter, it’s now easier to get a taste of the real world thoughts and discussions happening amongst the voting populace. I’ve had the privilege of working with renowned data scientist Kurt Bollacker last week to come up with some quick analysis of what’s being said about politicians on Twitter.

Gathering The Data

Here’s what we did.

The Federal Election Commission keeps records of all of the candidates for the House and Senate races. From there, we were able to get a list of 3,533 candidate names. With this list, we culled our Twitter repository for these candidate names. We treated them as full names, since in most cases merely the first or last name would come up with many unrelated tweets.

So, for each candidate name, we were able to accumulate a “wordbag” of terms that are commonly found in tweets that contain their name. To clean up the wordbag data, I first hid the candidate’s name from the wordbag, and then hid very common terms like “http”, “bit”, “com”, and “www” since the relative frequency of these common terms were eclipsing the other words. For this study, we searched our entire Twitter corpus that contains over 3 billion tweets dating back to its inception in 2006.

I’ve placed the raw wordbag data up in the Infochimps data marketplace, so take a look if you’re so inclined.

Taking A Look At Specific Candidates

For the Massachusetts House race between incumbent Democrat Barney Frank and Republican challenger Sean Bielat, we were able to use a word visualization tool, Wordle, to illustrate the two wordbags. The key thing to remember here is that this is analysis for Tweets that contain the politician’s name — and not Tweets by the politician.

For Tweets that mention Barney Frank (D-MA), the most commonly found words are:
Screen shot 2010 11 01 at 8.26.26 AM Republican and Democrat Words.

And, for Tweets that mention Sean Bielat, the most commonly found words are:
Screen shot 2010 11 01 at 8.27.12 AM Republican and Democrat Words.

One curious term in the Sean Bielat wordbag is “235l44r” — this turns out to be a reference to http://tinyurl.com/235l44r, which resolves to a blog post on Bielat’s blog.

For both candidates, “tcot” is a strong signal. This is an acronym for “Top Conservatives on Twitter,” who commonly apply the hashtag “#tcot” to their tweets. What is more interesting is that #tcot is much more popular of a term to use in conjunction with Barney Frank, the Democrat, than his Republican challenger.

Taking a look at another race, this time in California, between Democratic incumbent Barbara Boxer and Republican challenger Carly Fiorina, we, once again, generated wordbag visualizations using Wordle:

For Tweets that mention Barbara Boxer (D-CA), the most commonly used words are:
boxer wordbag Republican and Democrat Words.

And, for Tweets that mention Carly Fiorina, the most commonly used words are:
fiorina wordbag Republican and Democrat Words.

As expected, Boxer is commonly associated with Obama, and Fiorina is commonly grouped with Meg Whitman, the Republican candidate for California Governor.

One term that stands out notably for both candidates is “hair.” This most likely stems from Fiorina’s errant comment about Boxer’s hair she made when she was off-camera. Way to stick to the important issues, Twitterverse.

Looking for party differences

We were curious to see if there are any noticeable differences between the words used when tweeting about politicians from either party. From our filtered subset of the Twitter corpus, we ran a wordbag analysis against a set of tweets segmented by political party. To simplify things for this case, we only looked at Democrats and Republicans. Once again, if you’re interested in the raw data, it’s available at the Infochimps data marketplace.

We wondered what terms were more unique in talking about each party, so to do this, we compared the percentage frequency of a term’s occurrence to the percentage frequency in the other party. From this analysis, we ended up with a measure of each term’s relative frequency within each party’s filtered Twitter corpus.

I picked out a few that I thought were interesting:
image0012 Republican and Democrat Words.

We can see that “god”, “lie” and “republican” are used predominantly in Tweets that correspond to Republicans, and “obamacare”, “daily” and “democrat” are used in Tweets that correspond to Democrats. “Daily” most likely refers to The Daily Show, of course. Interestingly, “racist” is split exactly evenly between parties. And, for Tweets about politicians that refer to the Huffington Post, 52% of the time, they’re talking about Democrats.

There is a hypothesis going around that Democrats’ political messaging is much less focused than the Republicans’, and this data seems to support that theory. “Obamacare” is most likely a term used by the Republicans in a derogatory manner, and that is one of the most commonly tweeted terms. The Dems have failed to unify around such a term.

Furthermore, the aforementioned “#tcot” phenomenon indicates that conservatives seem much more organized than the liberals on Twitter, at least in their hashtag usage. If there is such a hashtag on the liberal side, it has failed to gain any type of momentum thus far. At first, I thought that “#tlot” was the liberal banner, but indeed, that has already been co-opted by the “Top Libertarians On Twitter.”

Where are all of the Democrats on Twitter? Perhaps they’re all quietly watching the Daily Show.

What’s next?

Now, in looking at this analysis, a whole slew of other questions start to arise. How does this word bag change over time? What sentiment is associated with each of the tweets, and how does that change over time? To me, to have these types of questions start to surface is a promising indication that we are on to something. After all, the first step towards answering a question is to figure out what question to ask in the first place.

Words are powerful, and analyzing Twitter provides us with some interesting insights into what people are saying about politicians today. If you’re interested in doing similar analysis on your own, take a look at the raw wordbag data or our Wordbag API. If you find something interesting, give us a shout — we’d love to hear from you.

Oh, and don’t forget to vote tomorrow!