Products & Features

New Data: Dinner, Drinks & The Breakup

34511 New Data: Dinner, Drinks & The BreakupSo, you think you’ve met the perfect chimp companion?  It takes more than just a shiny coat to attract the ladies and gentleman of your species… you’ve got to know how to cook… make tasty drinks.  But what if all your best laid plans fail? Then you’ve got to end it… but how?  We’ve got all the APIs and datasets to help you navigate the sticky world of monkey love.

Cook the Perfect Dinner
They say the fastest way to one’s heart is through their stomach.  You, my chimpy friend, can take advantage of this old adage with the Punchfork API, which will let you easily integrate recipes into your website or app by providing direct access to recipe data from all sorts of publishers – from bloggers to top recipe sites.  With their API, you can get top rated recipes to help you prepare that perfect simian supper to woo your potential mating partner.

Make Good Mixed Drinks
Have no clue what to do with triple sec, amaretto and bitters?  Some smart chimps scraped the data from a site with over 9000 different popular drink combinations and ran taste trials with their peers to determine which drinks tasted good together, and which ones were not compatible.  Now you can pretend to be Tom Cruise from that scene in Cocktail (trick bartending skills not included).

Relationship Breakup Mechanisms
Has your date turned out to be a dud despite your best efforts?  Well, we’ve got a list of ways break up with your not-so-significant-anymore other that can serve as a great starting point for ending things.

We’ve got over 14,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

New Data Sets: Alcohol, Free WiFi & How Long You’ve Got Left to Live

phonemonkey New Data Sets: Alcohol, Free WiFi & How Long You’ve Got Left to LiveOur chimps have been busily scouring the data jungle and thanks to users like AggData and vanceinteriors, we got over 1000 delicious new datasets in just the last two weeks!  Today, we’ll highlight a few of our favorites and answer some of your most burning questions.

How Long Do I Have Left to Live?
How long have I got, Doc? In an interesting measure from the US Census, here’s a free dataset gives the average number of years an individual in the US has left to live, given their age, sex and race.

Where Can I Get Free WiFi?
Ever find yourself in a new city wondering where you can get free WiFi?  This dataset contains over 63,000 locations throughout the entire US with the latitude/longitude and business name.

What Bar in Austin Sells the Most Mixed Booze?
Curious what the hottest bars in town are (based on mixed drinks sales)?  The Texas Alcoholic Beverage Commission has got your answer!  This free dataset contains the trade name, address and reported tax on mixed drink sales for bars throughout Texas.  

Be the first to answer this question in our comments and we’ll send you a package of sweet Chimpy stuff (stainless steel water bottle, stickers and Startup: The Hackering!): What bar in Austin had the highest reported mix drink sales in May 2011?

We’ve got over 14,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need?  Let us know on UserVoice!

New Data Sets: Colleges, Hospitals, The Marvel Universe and Social Data

cheekymonkey 191x300 New Data Sets: Colleges, Hospitals, The Marvel Universe and Social DataLook what we found!  No, it’s not just a picture of a baby monkey, though we did think it was apropos for our new weekly feature highlighting some of the best new data sets to join our ever growing data marketplace.  Today, we bring you a mix of geo-data and social data, including a social graph constructed by three researchers the Balearic Islands, of the Marvel Universe, which surprisingly is not unlike a real-life social network!

US Colleges and Universities
This is a database of all US college and universities, as of 2010. There are 9350 total colleges and universities listed with name, address, phone number and URL.

US Hospitals
This data set contains 49 fields on 4287 hospitals in the United States. While not all datapoints are available for every hospital, this robust data set contains info such as: location and contact information, heart attack mortality rate, gross patient revenue, number of staffed beds, approximate average patient length of stay and patient satisfaction along several metrics.

2000+ Flickr Images, 10,000+ YouTube Videos and 10,000+ Digg Users
These data sets, courtesy of Munmun De Choudhury, showcase large scrapes of social data that has been used by the post-doctoral fellow to perform image content analysis, examine dynamics of threaded comments in rich media sharing, study information diffusion and community evolution centered around the topics.

Marvel Universe Social Graph
This fun Marvel Comics character collaboration graph showcases the artificial world that takes place in the universe of the Marvel comic books as an example of a social collaboration network. They compare the characteristics of this universe to real-world collaboration networks, such as the Hollywood network, or the one created by scientists who work together in producing research papers and find that the Marvel Universe is surprisingly closer to a real social graph than one might expect.

We’ve got over 13,700 more where that came from.  Visit our site today and search for the data you want.  Can’t find what you need? Let us know on UserVoice!

Measuring online influence: The case for Big Data

Measuring influence online is in its infancy. Unrefined ‘metrics’ dominate the space and much of what exists currently is of little value and has insignificant statistical meaning. Most measure only what is easy to measure – number of Twitter followers, number of times a word is mentioned in the last week, Facebook ‘likes’, bizarre – often undisclosed – methods of calculating someone’s online social ‘rank’. There are even gimmicky schemes to produce ‘measurements’ like Fast Company’s “Influence Project”*.

Why do we do things this way? Because it is easy and because dealing with big data is hard.

The most valuable measurements that come out of this space live in the analysis of big data. To be effective, one needs a global perspective and all the connections – not just of the 100 active million Twitter users, but the 4 billion connections between them and even more, the billions of additional connections implied by mentions, retweets and replies.

A simple search of Twitter will yield you a small sample of unfiltered tweets in a short time frame. A count of followers is as easy as visiting a user’s profile page.
Those metrics miss out on what is important. Which tweets were most significant? Which users made the most impact? Are someone’s followers actual people or just spam bots? Are they just part of an auto-follow-back ponzi-scheme initiated by some live-for-3-days-and-milk-AdWords-for-all-it’s-worth-webapp?

But what should we measure?

Last week, HP published a study on what makes a tweet influential and the problems around the measure of “Influence”. Several bloggers responded (1,2,3). In general it is agreed that “Influence” is still not fully defined and that retweets alone are not the definition. Retweets are one way to measure, links are another. “Engagement” as a whole is an intersecting issue. Clearly though, there is an interest in measuring all of these ‘things’, if only we could define them.

How should measurements be delivered?

Companies such as Klout give composite numbers that, while useful, fall short of being helpful for a wider range of use cases such as spam filtering, topic relevancy and understanding relationships in one’s network.  ”Influence” is a broader topic that spans much more than just global rank or one’s own reach.  Transparency is an issue as well.  While a single number helps at being actionable, it is also a set of magically combined factors that comes with no clue as to how they are combined and weighted.  People should know where their data came from and how it was produced if they are to trust it.

Solving for Influence

The data is out there, the tools exist to analyze it, but the world still unsure of what to tell those tools to do.

Infochimps uses the full friend graph and historical tweets of Twitter back to 2006 to produce similar metrics to what HP discusses such as Sway (how much a user gets retweeted), trstrank (global ranking of influence, Google PageRank style) and Enthusiasm, which is a reverse measure of what HP refers to as “Passivity” – that is, how often someone retweets someone else.

There exists a very basic problem in nomenclature.  What are the words we should be using and what are the human behaviors they represent?

More importantly – HP, Infochimps and others are still discovering just what should be analyzed. What is still needed is for people in other fields to fill in those gaps.  In sociology and marketing: what is the difference between fame, interestingness, influence and even, infamy?  What constitutes the ‘humanness’ of a user (are they real, are they just a bunch of employees tweeting for a celebrity)?  What are influence and engagement? What should we be looking for in relationships? In the social CRM space: how should you identify relationship networks?  What are the best ways to find a route to new leads?

What are the use cases in each of these fields for the data?  The big data world has the tools to quantify the data; it just needs to know which questions to answer.  Once we know what behaviors to look for, they can be translated into signatures that are identifiable in data.

Finally, we also need the answer to what the methodology is for combining the data into actionable metrics.

Measuring what is difficult instead of what is easy is a game changer. In fact, not just a game or a ballpark, but a whole sport. The knowledge is in the data.

* If you must visit it, here’s the address: influenceproject.fastcompany.com

Real geeks don’t use IE – Infochimps Browser Usage Analytics

Browser usage by the somewhat normal web

When one is scoping out a web project, one of the first requirements that a designer/web programmer will want to know is “what browsers are we supporting?”. The decision is usually led by a quick googling to find a page like the W3C’s which quickly tells you:

2010 IE8 IE7 IE6 Firefox Chrome Safari Opera
July 15.6% 7.6% 7.2% 46.4% 16.7% 3.4% 2.3%

Over 30.8% of the browser world belongs to IE (much better than the way things were just a few years ago). Almost 15% of your users are using such an old version of IE that you may be tempted to code using IE6 or 7 as your least common denominator.

Browser usage by Infochimps users

Consider who is visiting your site though. Are your users more net savvy? Are they geeks? Here’s what our visitors use:

api.infochimps.com usage 2010 08 11 Real geeks dont use IE   Infochimps Browser Usage Analytics

About 10% of infochimps.org users use IE, almost a third of the norm.
Half of our IE users use IE8 (a much more capable version of IE) leaving a meager 5% in the IE6/7 realm, which is split half and half (2.5% total IE6 users – again, almost a third of the normal).

Conclusion: Real nerds don’t use Internet Explorer

As far as design philosophy goes, we strive to design our sites (infochimps.org, api.infochimps.com) in a progressive enhancement fashion so that all browsers can be supported well (enough) and accessibility is simple and works. IE6 isn’t number one on our list of things to deal with.

When you have limited resources (like a startup), consider who is actually using your site before spending resources on that group.

Refreshed Datasets!

By popular demand, we have refreshed our massive corpus of Twitter data. As part of the facelift, some of our API fields have been eliminated, and many more have been added. Trstrank, for instance, will include a new field called Trstquotient, or TQ, which can be used as a spam indicator. (For details on how that works, stay tuned for a forthcoming blog post). The fields we chose to eliminate from Trstrank–followers, following, and statuses–can be readily accessed via Twitter’s API.

Our new datasets will provide the most accurate and up-to-date reflection of a Twitter user’s measure of influence (Trstrank), activity level (Influencer Metrics), and interactions between two given users (Conversations). The datasets that changed the most, Influencer Metrics and Conversations, have lots of new fields.  Influencer Metrics is now a more rigorous way to measure retweets and @ replies, both incoming and outgoing, and Conversations gives a full summary of the interactions between two users.

Screen shot 2010 08 02 at 8.05.04 PM2 250x300 Refreshed Datasets!

We’re versioning the new API calls, to prevent the unpleasantness that could accompany a rapid switcheroo, but our old calls will be phased out quickly. We welcome your feedback on this exciting update!

Cool things to be built with the Infochimps API

We started a page of ideas of cool things you can build using the Query API. There are a ton of valuable things that can be done using the current API calls and we’d love to see them made. Here are some of them:

  • Filter influencers or non-influencers from any feed of tweets (Influence and/or Trstrank)
  • Filter Twitter spam (Trstrank and/or influence)
  • Build a word cloud for a Twitter user in any app (Wordbag)
  • Target content/ads based on words a user tweets about the most (Wordbag)
  • Find the true influence of a Twitter user by combining their Trstrank, ratio of friends/followers, ratio of statuses to retweets in, etc (Trstrank and Influence)
  • Find social circles on Twitter, not by followers, but by who is actually talking to each other (Conversation)
  • Target content/ads based on IP address (IP→Census)
  • A/B test your website/web app based on demographic data (IP→Census)
  • Build a site that lists a person’s Twitter followers with columns for trstrank, influence metrics (display them as ratios) and wordbag. (Trstrank, Influence, Wordbag)
  • Integrate reputation metrics into your Twitter client to help users decide who and who not to follow and also filter their tweet streams. (Trstrank, Influence, Wordbag)
  • Demographic web analytics. Build an app/plugin/etc to analyze web server logs (or log it and analyze remotely with JavaScript) that gives demographic information about a website’s users (IP→Census)

If you’ve got your own idea feel free to post it here or just send it to us!

Infochimps API in Action

Back in May when our API was still in its infancy, Sean McDonald, founder of Jute Networks, requested access to the Trstrank data to explore the potential application of it on network relationship management. He created a proficient report and raised some pointed questions that some of our other datasets can now answer. We thought it prudent to showcase his work, not only because it’s just plain nifty, but also because it illustrates the exciting synergy of our calls and the particularly appetizing value of them to market researchers.

If you’re attempting to promote something on Twitter, it’s likely that you would want to focus on promoting it amongst the Twitter luminaries. Enter Trstrank, our exciting little measure of Twitter luminescence. Getting your product promoted by someone with a high Trstrank could potentially be marketing gold. The likelihood, however, of someone with a very high Trstrank nurturing your product’s visibility with a steady stream of cooing retweets is slim to, well, none. So how to know where to focus your evangelizing efforts?

Sean wondered the same thing when he set about to promote his report. He created the following visualization of an arbitrarily selected sample of his Twitter friends positioning himself in the center, companies in the inner circle, and contacts associated with those companies in the outer circle. Any contact or company with a Trstrank greater than five is designated by a blue dot; those with a Trstrank between two and five are designated by an orange dot. This gives a useful snapshot of who occupies a “strategic position” in his Twitter universe.

Seans Network1 300x298 Infochimps API in Action

Sean hypothesized that the least likely to engage and retweet his report were both the most top-ranked and most bottom ranked. Eliminating those two tails would yield a swath of active users to target, the orange dots. Ten of Sean’s thirty sample contacts were orange dots. Of those ten users, Sean eliminated seven of them based on personal knowledge he had of them (i.e. he didn’t know them very well or knew they didn’t care about data and data visualization). This left him with three contacts to enlist in his promotional efforts. Sean’s strategy is very savvy, but requires some amount of personal familiarity with contacts, a luxury not every promoter has.

Seans Orange Dots 300x297 Infochimps API in Action

Fortunately, two of our newer API calls, can simulate Sean’s marketing method. Influencer Metrics will show you how likely a user is to retweet a post based on their tweeting history.  Coupling Influencer Metrics with Trstrank would enable a promoter to identify not only the users most likely to engage, but also the most influential of those users. Throw Wordbag into the mix and a promoter could also discover if users in the active, influential target population have a potential interest in their product.

We would love reader feedback about our current API calls. How do you envision them working together? What other kind of calls would be of benefit to you? Let us know your ideas.

Introducing the Infochimps Query API

Infochimps is pleased to announce the release of our Query API in public beta today. As part of our ongoing effort to democratize access to structured data, the Infochimps Query API offers several calls that allow you to analyze a prodigious amount of Twitter data dating back to 2006. Our current operational calls include the following:

Trstrank

Trstrank uses an algorithm similar to Google Page Rank to generate a numerical rank that indicates the amount of influence a particular user has. This is a much more robust way to determine a Twiter user’s influence than by their number of followers alone.

Wordbag

Wordbag enables you to discover what a specific Twitter user finds interesting. After entering the handle of a specific Twitter user, Wordbag generates a list of words unique to that Twitter user.

Influencer Metrics

Influencer Metrics measures the number retweets, mentions, and @replies that a specific Twitter user has. Retweets and mentions can indicate the value the Twitter community gives to the tweets of a specific user. Coupling Trstrank with Influencer Metrics provides a particularly powerful way to gauge the influence of a Twitter user.

The potential applications of our API are limited only by the imagination. We hope market researchers, brazen self-promoters, statisticians, sociologists, cultural anthropologists, linguists, and all the curious Georges out there will find it as compelling as we do.

Looking to the future, our development team will be constantly polishing and updating the API. Follow @infochimps on Twitter for announcements. We received many requests on our private beta for more frequent refreshments of our data and fuller coverage.  Our next update will do just that. We have additional API calls percolating, including one that will allow you to discover close-knit interactions between Twitter users and see the level of interaction between them.

For features and pricing, including our totally free package, the Baboon, click here.

Fresh Twitter data on Infochimps, plus Announcing Trst.me

Today we’re announcing some cool new products we’ve derived from the Twitter API. These include:

Our own @thedatachef and @jessecrouch created this visualization of user background colors on Twitter, using the free dataset on Infochimps:
the color of twitter small Fresh Twitter data on Infochimps, plus Announcing Trst.me
Hack away at the free datasets and create visualizations of your own! Anything you create from these datasets we will be glad to feature here on our blog.

Please note: These datasets only contain either anonymized aggregate counts or simple user statistics as they come from the Twitter API. The pagerank dataset contains a derived reputation number, and none of the datasets contain full tweets.