Monthly Archives April 2010

Fresh Twitter data on Infochimps, plus Announcing Trst.me

Today we’re announcing some cool new products we’ve derived from the Twitter API. These include:

Our own @thedatachef and @jessecrouch created this visualization of user background colors on Twitter, using the free dataset on Infochimps:
the color of twitter small Fresh Twitter data on Infochimps, plus Announcing Trst.me
Hack away at the free datasets and create visualizations of your own! Anything you create from these datasets we will be glad to feature here on our blog.

Please note: These datasets only contain either anonymized aggregate counts or simple user statistics as they come from the Twitter API. The pagerank dataset contains a derived reputation number, and none of the datasets contain full tweets.

To Open or Not to Open Data: A Private Organization’s Dilemma

Open data has thus far largely been associated with government data. Though government data is indeed valuable, the potential of the data that private organizations gather has been overlooked. These organizations usually don’t realize the potential that their data holds.

At the Data Cluster last month, our own Dhruv Bansal and Gil Elbaz of Factual led the Open Data Birds-of-a-feather session. Using insights from that discussion, and some of our own, we want to highlight some pros and cons of this process to help organizations determine whether opening their data is the right move:

Pros
1. Profit generation – Almost all data will have some value to someone else, whether an organization realizes it or not. Putting up data for sale would help these organizations realize how valuable their data is and may even provide another revenue stream from this latent resource. For example, a firm with data on parking meter locations and occupancy rates can sell it to a firm building an iphone app to help you reliably find parking in our nation’s downtowns.
2. Crowd-sourced curation – Gil commented that a lot can be gained from crowd-sourced curation. Firstly, the organization avoids the costs of curating the data themselves. Secondly, the pool of brains working on the data can amount to incredible products that were not immediately evident, especially when your data is mashed with others’. In this Factual table of Nationwide Restaurants, geo data is mashed with information and reviews of restaurants from sites yelp, Yahoo! Citysearch and Zagat, to make this interactive search table.
3. Potential uses – There are many different uses for data that range from cool informational data visualizations to applications to mining for insights. The organization avoids the costs of having to set up infrastructure and gather manpower to translate the data into these products by opening their data for others to use.
Some examples of what has already been done with open government data can be found in a previous blog post “Open data applications”
4. Exposure – Organizations can gain exposure from opening their data, especially now while it’s still relatively uncommon, positioning itself on the cutting edge of the data sphere. Additionally, transparency is demanded more these days, and this is one of the ways to achieve that. Best Buy has an open API called the Best Buy Remix of their product catalog. With this open API, they not only leave the development of apps to others, but they also gain exposure and generate business from apps that would, for example, allow users to search for products they want and get details on it (location, price, specs, etc).

Cons
1. Historically difficult – The development of the market for alternative data is relatively new. Opening data used to be incredibly difficult, expensive and labor-intensive. Large amounts of data took a lot of time and were extremely hard, if not impossible, to process. However, things such as cloud computing and processing tools like Hadoop have helped address these problems, making the whole data process a lot easier.
2. Privacy concerns – These fall under two types: First, some companies might be concerned about certain data being accessed by their competitors. This problem can be avoided since companies can choose what data they open and keep more sensitive data secret. In the end, these organizations might find that the data that is crowd-sourced may result in interesting insights that would further develop their product/service. Second, there are also concerns about users’ personal data. Efforts need to be made to ensure that they understand how their data is being used, security upheld, and how to opt-out if they choose to do so.
3. Data processing – Some organizations don’t have the capabilities to process the data for public consumption, but if they really do have valuable data, then a cost-benefit analysis might show that setting up the required infrastructure is worth it. If a company just doesn’t have the resources for this, as mentioned earlier, it can leave some of the data processing to the crowd.
4. Reservations about crowd-sourcing – Someone from Wolfram Alpha pointed out that companies may believe that expert curation is better than crowd-sourcing. What these companies fail to realize is that there are increasingly more people fluent in data. Crowd-sourcing their many talents and ideas means that a lot more can be done with their data- things that one expert alone may overlook.

Verdict? Open your data! The data market is growing and infrastructure is developing alongside. The traditional hindrances to opening data, such as the scarcity of people who can curate data, the difficulty of identifying buyers, and the impossibility of handling large amounts of data, are dissipating. Instead, a lot of potential lies in the data, from financial gains to the increase of brand recognition. With all this in mind, companies need to take a second look at their data and evaluate its worth.

Our 7 Most Popular Data Categories

As a marketplace for data, people often ask us what are the most popular types of data. Interestingly, the answer isn’t very intuitive. For example, who would have guessed that one of our most downloaded datasets is a crossword puzzle word list? With this in mind, we decided to investigate and come up with a more complete answer.

Using metrics from the Infochimps website, such as search queries, downloaded content, data requests, page views, and so forth, we came up with a list of the top 7 searched for data categories:

1. Social Networks
2. Economics & Finance
3. Demographics
4. Education
5. Sports
6. Geography
7. Music

Why is this list important?
Many different people may find this information useful: From data sellers who want to know what kinds of data people would buy, trend watchers who are tracking what is popular, to anyone who is curious about what types of data are out there.

What can we learn from this list?
One thing that we can take away from this list is that there is a good indication of data becoming more mainstream. Data that is being searched for isn’t limited to academically-oriented information that traditional data users (such as researchers) would use. Categories like sports and music are of interest to a wider audience.

Furthermore, there is a demand for these types of data, but not always a supply. As the usefulness of data is becoming more realized, the infrastructure to facilitate the data process is still forming. Infochimps is one of the platforms that aims to bridge the gap, through functions like the dataset request page, and provides a venue for data exchange. If you’ve got data in these categories, know that there are people out there who want your data.