- November 1, 2010
So here at Infochimps, we thought it would be interesting and timely to take a look at some of the things that are being said about the candidates. With Twitter, it’s now easier to get a taste of the real world thoughts and discussions happening amongst the voting populace. I’ve had the privilege of working with renowned data scientist Kurt Bollacker last week to come up with some quick analysis of what’s being said about politicians on Twitter.
Gathering The Data
Here’s what we did.
The Federal Election Commission keeps records of all of the candidates for the House and Senate races. From there, we were able to get a list of 3,533 candidate names. With this list, we culled our Twitter repository for these candidate names. We treated them as full names, since in most cases merely the first or last name would come up with many unrelated tweets.
So, for each candidate name, we were able to accumulate a “wordbag” of terms that are commonly found in tweets that contain their name. To clean up the wordbag data, I first hid the candidate’s name from the wordbag, and then hid very common terms like “http”, “bit”, “com”, and “www” since the relative frequency of these common terms were eclipsing the other words. For this study, we searched our entire Twitter corpus that contains over 3 billion tweets dating back to its inception in 2006.
I’ve placed the raw wordbag data up in the Infochimps data marketplace, so take a look if you’re so inclined.
Taking A Look At Specific Candidates
For the Massachusetts House race between incumbent Democrat Barney Frank and Republican challenger Sean Bielat, we were able to use a word visualization tool, Wordle, to illustrate the two wordbags. The key thing to remember here is that this is analysis for Tweets that contain the politician’s name — and not Tweets by the politician.
One curious term in the Sean Bielat wordbag is “235l44r” — this turns out to be a reference to http://tinyurl.com/235l44r, which resolves to a blog post on Bielat’s blog.
For both candidates, “tcot” is a strong signal. This is an acronym for “Top Conservatives on Twitter,” who commonly apply the hashtag “#tcot” to their tweets. What is more interesting is that #tcot is much more popular of a term to use in conjunction with Barney Frank, the Democrat, than his Republican challenger.
Taking a look at another race, this time in California, between Democratic incumbent Barbara Boxer and Republican challenger Carly Fiorina, we, once again, generated wordbag visualizations using Wordle:
As expected, Boxer is commonly associated with Obama, and Fiorina is commonly grouped with Meg Whitman, the Republican candidate for California Governor.
One term that stands out notably for both candidates is “hair.” This most likely stems from Fiorina’s errant comment about Boxer’s hair she made when she was off-camera. Way to stick to the important issues, Twitterverse.
Looking for party differences
We were curious to see if there are any noticeable differences between the words used when tweeting about politicians from either party. From our filtered subset of the Twitter corpus, we ran a wordbag analysis against a set of tweets segmented by political party. To simplify things for this case, we only looked at Democrats and Republicans. Once again, if you’re interested in the raw data, it’s available at the Infochimps data marketplace.
We wondered what terms were more unique in talking about each party, so to do this, we compared the percentage frequency of a term’s occurrence to the percentage frequency in the other party. From this analysis, we ended up with a measure of each term’s relative frequency within each party’s filtered Twitter corpus.
We can see that “god”, “lie” and “republican” are used predominantly in Tweets that correspond to Republicans, and “obamacare”, “daily” and “democrat” are used in Tweets that correspond to Democrats. “Daily” most likely refers to The Daily Show, of course. Interestingly, “racist” is split exactly evenly between parties. And, for Tweets about politicians that refer to the Huffington Post, 52% of the time, they’re talking about Democrats.
There is a hypothesis going around that Democrats’ political messaging is much less focused than the Republicans’, and this data seems to support that theory. “Obamacare” is most likely a term used by the Republicans in a derogatory manner, and that is one of the most commonly tweeted terms. The Dems have failed to unify around such a term.
Furthermore, the aforementioned “#tcot” phenomenon indicates that conservatives seem much more organized than the liberals on Twitter, at least in their hashtag usage. If there is such a hashtag on the liberal side, it has failed to gain any type of momentum thus far. At first, I thought that “#tlot” was the liberal banner, but indeed, that has already been co-opted by the “Top Libertarians On Twitter.”
Where are all of the Democrats on Twitter? Perhaps they’re all quietly watching the Daily Show.
Now, in looking at this analysis, a whole slew of other questions start to arise. How does this word bag change over time? What sentiment is associated with each of the tweets, and how does that change over time? To me, to have these types of questions start to surface is a promising indication that we are on to something. After all, the first step towards answering a question is to figure out what question to ask in the first place.
Words are powerful, and analyzing Twitter provides us with some interesting insights into what people are saying about politicians today. If you’re interested in doing similar analysis on your own, take a look at the raw wordbag data or our Wordbag API. If you find something interesting, give us a shout — we’d love to hear from you.
Oh, and don’t forget to vote tomorrow!