API’s and Datasets, living in harmony

The most popular way for one to access data on the web right now is through an API.  API’s provide real-time data, an incredible advantage, and outsourced computation.  These are advantages for the end-user and the developer, where the API provider has to eat the cost of providing such a service.  It is worth it, though, for the provider of the API, as a myriad of services can be conjoined with their primary service.

There are some things an API can’t give you, though.  An API can generally not give you historical data, as with Twitter’s API only letting you go back XX number of tweets.  This means that a service built late in the game may not carry the same value as a service that was built in the early days of an API, as the latter’s data goes back further.

Next, API’s only give you peices.  The scale of questions you can ask is limited by the rate limit and sizes of the peices to return.  Services can’t ask for everything and they may be further limited by the bandwidth and load on the primary API.

The types of questions we’re talking about have to do with the deep structure of the data in question.  One of the reasons our near-complete scrape of Twitter’s friend graph was so popular is because this type of dataset is extremely valuable to network researchers.  The sort of research a graph like Twitter’s makes possible is phenomenal.  Without such a dataset, reserachers are left like the Antarctic exploresrs of the past – slowly crawling new territory, making maps and filling in details only as they come along, peice by peice.

The value API’s provide to the service and the outside world is undeniable.  The problems that API’s leave open can be solved by those services providing complete dumps periodically.  These datasets of complete and historical data will not only let researchers get to work improving their science, but will also allow applications to seed their service with the latest dataset, then begin updating through the API.

Should services share their data on a platform like Infochimps, they not only provide a great service to applications and researchers, but they also reduce their own costs.  The load on their API is lighter as less requests have to be made for data.  And, when researchers have the complete dataset sitting on their hard drive, the API’s provider will not be depended upon for compute time, as the researcher’s local access to the data will make his job much faster and easier.

The two solutions for sharing data are complimentary.  Freebase does a great job at this, we are hoping other services will soon follow suit.


  1. loopzer December 3, 2009 at 9:19 am

    your theme is giving me php errors ?