Monthly Archives September 2009

New site is live

Thanks to everyone new that’s come by. We appreciate the coverage from www.gigaom.com and others. We thought we’d spend a moment to cover what we hope to accomplish from this launch.

With this launch anyone can edit or add datasets to the site. Very soon, uploading will work and we can host and distribute open licensed datasets for free. These are our steps towards building an open data commons.

Additionally, this new site offers a few datasets for sale. These datasets are not ours, but owned by others. We make a commission on the sale of these datasets. An example is the TAKS dataset, which contains all of the test scores data for students in the state of Texas on standardized tests. This dataset has cost one particular researcher $1400 to free from the government coffers, and the format it came in was awful. On Infochimps you can find the same dataset but in a cleaned up format, and for a much lower price – $15.

We consider this marketplace offering an incentive to the world of data gatherers to put their data somewhere others can find it. By letting people charge for their data, we encourage data to come out of the woodwork that might otherwise remain behind closed doors.

We hope you enjoy playing around with the site. If you are excited to send data our way before we get upload working, please get in touch: upload@infochimps.org.

API’s and Datasets, living in harmony

The most popular way for one to access data on the web right now is through an API.  API’s provide real-time data, an incredible advantage, and outsourced computation.  These are advantages for the end-user and the developer, where the API provider has to eat the cost of providing such a service.  It is worth it, though, for the provider of the API, as a myriad of services can be conjoined with their primary service.

There are some things an API can’t give you, though.  An API can generally not give you historical data, as with Twitter’s API only letting you go back XX number of tweets.  This means that a service built late in the game may not carry the same value as a service that was built in the early days of an API, as the latter’s data goes back further.

Next, API’s only give you peices.  The scale of questions you can ask is limited by the rate limit and sizes of the peices to return.  Services can’t ask for everything and they may be further limited by the bandwidth and load on the primary API.

The types of questions we’re talking about have to do with the deep structure of the data in question.  One of the reasons our near-complete scrape of Twitter’s friend graph was so popular is because this type of dataset is extremely valuable to network researchers.  The sort of research a graph like Twitter’s makes possible is phenomenal.  Without such a dataset, reserachers are left like the Antarctic exploresrs of the past – slowly crawling new territory, making maps and filling in details only as they come along, peice by peice.

The value API’s provide to the service and the outside world is undeniable.  The problems that API’s leave open can be solved by those services providing complete dumps periodically.  These datasets of complete and historical data will not only let researchers get to work improving their science, but will also allow applications to seed their service with the latest dataset, then begin updating through the API.

Should services share their data on a platform like Infochimps, they not only provide a great service to applications and researchers, but they also reduce their own costs.  The load on their API is lighter as less requests have to be made for data.  And, when researchers have the complete dataset sitting on their hard drive, the API’s provider will not be depended upon for compute time, as the researcher’s local access to the data will make his job much faster and easier.

The two solutions for sharing data are complimentary.  Freebase does a great job at this, we are hoping other services will soon follow suit.