Products & Features

New Data Sets: Colleges, Hospitals, The Marvel Universe and Social Data

cheekymonkey 191x300 New Data Sets: Colleges, Hospitals, The Marvel Universe and Social DataLook what we found!  No, it’s not just a picture of a baby monkey, though we did think it was apropos for our new weekly feature highlighting some of the best new data sets to join our ever growing data marketplace.  Today, we bring you a mix of geo-data and social data, including a social graph constructed by three researchers the Balearic Islands, of the Marvel Universe, which surprisingly is not unlike a real-life social network!

US Colleges and Universities
This is a database of all US college and universities, as of 2010. There are 9350 total colleges and universities listed with name, address, phone number and URL.

US Hospitals
This data set contains 49 fields on 4287 hospitals in the United States. While not all datapoints are available for every hospital, this robust data set contains info such as: location and contact information, heart attack mortality rate, gross patient revenue, number of staffed beds, approximate average patient length of stay and patient satisfaction along several metrics.

2000+ Flickr Images, 10,000+ YouTube Videos and 10,000+ Digg Users
These data sets, courtesy of Munmun De Choudhury, showcase large scrapes of social data that has been used by the post-doctoral fellow to perform image content analysis, examine dynamics of threaded comments in rich media sharing, study information diffusion and community evolution centered around the topics.

Marvel Universe Social Graph
This fun Marvel Comics character collaboration graph showcases the artificial world that takes place in the universe of the Marvel comic books as an example of a social collaboration network. They compare the characteristics of this universe to real-world collaboration networks, such as the Hollywood network, or the one created by scientists who work together in producing research papers and find that the Marvel Universe is surprisingly closer to a real social graph than one might expect.

We’ve got over 13,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

Measuring online influence: The case for Big Data

Measuring influence online is in its infancy. Unrefined ‘metrics’ dominate the space and much of what exists currently is of little value and has insignificant statistical meaning. Most measure only what is easy to measure – number of Twitter followers, number of times a word is mentioned in the last week, Facebook ‘likes’, bizarre – often undisclosed – methods of calculating someone’s online social ‘rank’. There are even gimmicky schemes to produce ‘measurements’ like Fast Company’s “Influence Project”*.

Why do we do things this way? Because it is easy and because dealing with big data is hard.

The most valuable measurements that come out of this space live in the analysis of big data. To be effective, one needs a global perspective and all the connections – not just of the 100 active million Twitter users, but the 4 billion connections between them and even more, the billions of additional connections implied by mentions, retweets and replies.

A simple search of Twitter will yield you a small sample of unfiltered tweets in a short time frame. A count of followers is as easy as visiting a user’s profile page.
Those metrics miss out on what is important. Which tweets were most significant? Which users made the most impact? Are someone’s followers actual people or just spam bots? Are they just part of an auto-follow-back ponzi-scheme initiated by some live-for-3-days-and-milk-AdWords-for-all-it’s-worth-webapp?

But what should we measure?

Last week, HP published a study on what makes a tweet influential and the problems around the measure of “Influence”. Several bloggers responded (1,2,3). In general it is agreed that “Influence” is still not fully defined and that retweets alone are not the definition. Retweets are one way to measure, links are another. “Engagement” as a whole is an intersecting issue. Clearly though, there is an interest in measuring all of these ‘things’, if only we could define them.

How should measurements be delivered?

Companies such as Klout give composite numbers that, while useful, fall short of being helpful for a wider range of use cases such as spam filtering, topic relevancy and understanding relationships in one’s network.  ”Influence” is a broader topic that spans much more than just global rank or one’s own reach.  Transparency is an issue as well.  While a single number helps at being actionable, it is also a set of magically combined factors that comes with no clue as to how they are combined and weighted.  People should know where their data came from and how it was produced if they are to trust it.

Solving for Influence

The data is out there, the tools exist to analyze it, but the world still unsure of what to tell those tools to do.

Infochimps uses the full friend graph and historical tweets of Twitter back to 2006 to produce similar metrics to what HP discusses such as Sway (how much a user gets retweeted), trstrank (global ranking of influence, Google PageRank style) and Enthusiasm, which is a reverse measure of what HP refers to as “Passivity” – that is, how often someone retweets someone else.

There exists a very basic problem in nomenclature.  What are the words we should be using and what are the human behaviors they represent?

More importantly – HP, Infochimps and others are still discovering just what should be analyzed. What is still needed is for people in other fields to fill in those gaps.  In sociology and marketing: what is the difference between fame, interestingness, influence and even, infamy?  What constitutes the ‘humanness’ of a user (are they real, are they just a bunch of employees tweeting for a celebrity)?  What are influence and engagement? What should we be looking for in relationships? In the social CRM space: how should you identify relationship networks?  What are the best ways to find a route to new leads?

What are the use cases in each of these fields for the data?  The big data world has the tools to quantify the data; it just needs to know which questions to answer.  Once we know what behaviors to look for, they can be translated into signatures that are identifiable in data.

Finally, we also need the answer to what the methodology is for combining the data into actionable metrics.

Measuring what is difficult instead of what is easy is a game changer. In fact, not just a game or a ballpark, but a whole sport. The knowledge is in the data.

* If you must visit it, here’s the address: influenceproject.fastcompany.com

Real geeks don’t use IE – Infochimps Browser Usage Analytics

Browser usage by the somewhat normal web

When one is scoping out a web project, one of the first requirements that a designer/web programmer will want to know is “what browsers are we supporting?”. The decision is usually led by a quick googling to find a page like the W3C’s which quickly tells you:

2010 IE8 IE7 IE6 Firefox Chrome Safari Opera
July 15.6% 7.6% 7.2% 46.4% 16.7% 3.4% 2.3%

Over 30.8% of the browser world belongs to IE (much better than the way things were just a few years ago). Almost 15% of your users are using such an old version of IE that you may be tempted to code using IE6 or 7 as your least common denominator.

Browser usage by Infochimps users

Consider who is visiting your site though. Are your users more net savvy? Are they geeks? Here’s what our visitors use:

api.infochimps.com usage 2010 08 11 Real geeks dont use IE   Infochimps Browser Usage Analytics

About 10% of infochimps.org users use IE, almost a third of the norm.
Half of our IE users use IE8 (a much more capable version of IE) leaving a meager 5% in the IE6/7 realm, which is split half and half (2.5% total IE6 users – again, almost a third of the normal).

Conclusion: Real nerds don’t use Internet Explorer

As far as design philosophy goes, we strive to design our sites (infochimps.org, api.infochimps.com) in a progressive enhancement fashion so that all browsers can be supported well (enough) and accessibility is simple and works. IE6 isn’t number one on our list of things to deal with.

When you have limited resources (like a startup), consider who is actually using your site before spending resources on that group.

Refreshed Datasets!

By popular demand, we have refreshed our massive corpus of Twitter data. As part of the facelift, some of our API fields have been eliminated, and many more have been added. Trstrank, for instance, will include a new field called Trstquotient, or TQ, which can be used as a spam indicator. (For details on how that works, stay tuned for a forthcoming blog post). The fields we chose to eliminate from Trstrank–followers, following, and statuses–can be readily accessed via Twitter’s API.

Our new datasets will provide the most accurate and up-to-date reflection of a Twitter user’s measure of influence (Trstrank), activity level (Influencer Metrics), and interactions between two given users (Conversations). The datasets that changed the most, Influencer Metrics and Conversations, have lots of new fields.  Influencer Metrics is now a more rigorous way to measure retweets and @ replies, both incoming and outgoing, and Conversations gives a full summary of the interactions between two users.

Screen shot 2010 08 02 at 8.05.04 PM2 250x300 Refreshed Datasets!

We’re versioning the new API calls, to prevent the unpleasantness that could accompany a rapid switcheroo, but our old calls will be phased out quickly. We welcome your feedback on this exciting update!

Cool things to be built with the Infochimps API

We started a page of ideas of cool things you can build using the Query API. There are a ton of valuable things that can be done using the current API calls and we’d love to see them made. Here are some of them:

  • Filter influencers or non-influencers from any feed of tweets (Influence and/or Trstrank)
  • Filter Twitter spam (Trstrank and/or influence)
  • Build a word cloud for a Twitter user in any app (Wordbag)
  • Target content/ads based on words a user tweets about the most (Wordbag)
  • Find the true influence of a Twitter user by combining their Trstrank, ratio of friends/followers, ratio of statuses to retweets in, etc (Trstrank and Influence)
  • Find social circles on Twitter, not by followers, but by who is actually talking to each other (Conversation)
  • Target content/ads based on IP address (IP→Census)
  • A/B test your website/web app based on demographic data (IP→Census)
  • Build a site that lists a person’s Twitter followers with columns for trstrank, influence metrics (display them as ratios) and wordbag. (Trstrank, Influence, Wordbag)
  • Integrate reputation metrics into your Twitter client to help users decide who and who not to follow and also filter their tweet streams. (Trstrank, Influence, Wordbag)
  • Demographic web analytics. Build an app/plugin/etc to analyze web server logs (or log it and analyze remotely with JavaScript) that gives demographic information about a website’s users (IP→Census)

If you’ve got your own idea feel free to post it here or just send it to us!

Infochimps API in Action

Back in May when our API was still in its infancy, Sean McDonald, founder of Jute Networks, requested access to the Trstrank data to explore the potential application of it on network relationship management. He created a proficient report and raised some pointed questions that some of our other datasets can now answer. We thought it prudent to showcase his work, not only because it’s just plain nifty, but also because it illustrates the exciting synergy of our calls and the particularly appetizing value of them to market researchers.

If you’re attempting to promote something on Twitter, it’s likely that you would want to focus on promoting it amongst the Twitter luminaries. Enter Trstrank, our exciting little measure of Twitter luminescence. Getting your product promoted by someone with a high Trstrank could potentially be marketing gold. The likelihood, however, of someone with a very high Trstrank nurturing your product’s visibility with a steady stream of cooing retweets is slim to, well, none. So how to know where to focus your evangelizing efforts?

Sean wondered the same thing when he set about to promote his report. He created the following visualization of an arbitrarily selected sample of his Twitter friends positioning himself in the center, companies in the inner circle, and contacts associated with those companies in the outer circle. Any contact or company with a Trstrank greater than five is designated by a blue dot; those with a Trstrank between two and five are designated by an orange dot. This gives a useful snapshot of who occupies a “strategic position” in his Twitter universe.

Seans Network1 300x298 Infochimps API in Action

Sean hypothesized that the least likely to engage and retweet his report were both the most top-ranked and most bottom ranked. Eliminating those two tails would yield a swath of active users to target, the orange dots. Ten of Sean’s thirty sample contacts were orange dots. Of those ten users, Sean eliminated seven of them based on personal knowledge he had of them (i.e. he didn’t know them very well or knew they didn’t care about data and data visualization). This left him with three contacts to enlist in his promotional efforts. Sean’s strategy is very savvy, but requires some amount of personal familiarity with contacts, a luxury not every promoter has.

Seans Orange Dots 300x297 Infochimps API in Action

Fortunately, two of our newer API calls, can simulate Sean’s marketing method. Influencer Metrics will show you how likely a user is to retweet a post based on their tweeting history.  Coupling Influencer Metrics with Trstrank would enable a promoter to identify not only the users most likely to engage, but also the most influential of those users. Throw Wordbag into the mix and a promoter could also discover if users in the active, influential target population have a potential interest in their product.

We would love reader feedback about our current API calls. How do you envision them working together? What other kind of calls would be of benefit to you? Let us know your ideas.

Introducing the Infochimps Query API

Infochimps is pleased to announce the release of our Query API in public beta today. As part of our ongoing effort to democratize access to structured data, the Infochimps Query API offers several calls that allow you to analyze a prodigious amount of Twitter data dating back to 2006. Our current operational calls include the following:

Trstrank

Trstrank uses an algorithm similar to Google Page Rank to generate a numerical rank that indicates the amount of influence a particular user has. This is a much more robust way to determine a Twiter user’s influence than by their number of followers alone.

Wordbag

Wordbag enables you to discover what a specific Twitter user finds interesting. After entering the handle of a specific Twitter user, Wordbag generates a list of words unique to that Twitter user.

Influencer Metrics

Influencer Metrics measures the number retweets, mentions, and @replies that a specific Twitter user has. Retweets and mentions can indicate the value the Twitter community gives to the tweets of a specific user. Coupling Trstrank with Influencer Metrics provides a particularly powerful way to gauge the influence of a Twitter user.

The potential applications of our API are limited only by the imagination. We hope market researchers, brazen self-promoters, statisticians, sociologists, cultural anthropologists, linguists, and all the curious Georges out there will find it as compelling as we do.

Looking to the future, our development team will be constantly polishing and updating the API. Follow @infochimps on Twitter for announcements. We received many requests on our private beta for more frequent refreshments of our data and fuller coverage.  Our next update will do just that. We have additional API calls percolating, including one that will allow you to discover close-knit interactions between Twitter users and see the level of interaction between them.

For features and pricing, including our totally free package, the Baboon, click here.

Fresh Twitter data on Infochimps, plus Announcing Trst.me

Today we’re announcing some cool new products we’ve derived from the Twitter API. These include:

Our own @thedatachef and @jessecrouch created this visualization of user background colors on Twitter, using the free dataset on Infochimps:
the color of twitter small Fresh Twitter data on Infochimps, plus Announcing Trst.me
Hack away at the free datasets and create visualizations of your own! Anything you create from these datasets we will be glad to feature here on our blog.

Please note: These datasets only contain either anonymized aggregate counts or simple user statistics as they come from the Twitter API. The pagerank dataset contains a derived reputation number, and none of the datasets contain full tweets.

Our 7 Most Popular Data Categories

As a marketplace for data, people often ask us what are the most popular types of data. Interestingly, the answer isn’t very intuitive. For example, who would have guessed that one of our most downloaded datasets is a crossword puzzle word list? With this in mind, we decided to investigate and come up with a more complete answer.

Using metrics from the Infochimps website, such as search queries, downloaded content, data requests, page views, and so forth, we came up with a list of the top 7 searched for data categories:

1. Social Networks
2. Economics & Finance
3. Demographics
4. Education
5. Sports
6. Geography
7. Music

Why is this list important?
Many different people may find this information useful: From data sellers who want to know what kinds of data people would buy, trend watchers who are tracking what is popular, to anyone who is curious about what types of data are out there.

What can we learn from this list?
One thing that we can take away from this list is that there is a good indication of data becoming more mainstream. Data that is being searched for isn’t limited to academically-oriented information that traditional data users (such as researchers) would use. Categories like sports and music are of interest to a wider audience.

Furthermore, there is a demand for these types of data, but not always a supply. As the usefulness of data is becoming more realized, the infrastructure to facilitate the data process is still forming. Infochimps is one of the platforms that aims to bridge the gap, through functions like the dataset request page, and provides a venue for data exchange. If you’ve got data in these categories, know that there are people out there who want your data.

Announcing bulk redistribution of MySpace data

Today, we’re excited to announce the availability of MySpace data for bulk download on Infochimps. This is a major step forward for Myspace in our eyes – a move that signifies their seriousness about data and the developers and academics that work with that data.

This data is not sold by MySpace, but given out for free from their API and then packaged by Infochimps for redistribution. By giving developers free access to publically available real-time data (such as status updates, music, photos, videos) MySpace reinforces its commitment to powering the real-time social Web and the development of open standards.

  • Every day, MySpace processes over 32 million activities and updates
  • MySpace opened up its real-time data with free-to-use APIs letting developers create robust products
  • MySpace offers more scale and richer content like music, photo, videos, apps than anyone else
  • Real-time data input and the ability to then share that in real-time will drive the socialization of content on the web
  • Data available for bulk download will help usher the next generation of data-driven research and application development. Now, using a dataset like word count by hour, developers and content providers can better understand how things are talked about and when.

    The benefits of having data available for bulk download instead of just an API are numerous. Developers can start with a sample dataset and get their apps started faster. Academics are much better served by a .csv than an API, and developers can take advantage of the datasets these experts create as a result of their research. Opening one’s data to the big data community makes all this and much more possible.

    API’s aren’t enough. New tools like Hadoop allow for the processing of huge datasets but necessitate having a local copy of the entire dataset. The advanced analytics that come from computing on top of a huge dataset (and at 25GB/day the MySpace stream is massive) will power the next generation of applications.

    The developers looking for this data can come to Infochimps to find the data they need. Let’s harbor a division of labor between the people who are experts in mining this data for insight, and the pros who can develop the applications on top of those discoveries. For example, Ryan Rosario of UCLA created a dataset of user’s moods by zip code, a historical emotional context for researchers, psychologists, and possibly a developer looking to take advantage of this MySpace feature.

    We’ll premiere the “best of MySpace” datasets in the hopes of supporting a relationship between MySpace and data-driven research and development. And any API owners out there should get in touch to talk about how we can make your data computable for the big data community.

    UPDATE: Here is a visualization of Users with geolocations from our dataset, User locations by lat/long:
    figure Announcing bulk redistribution of MySpace data