The data landscape online, as we see it. Part 1

Nathan at FlowingData did a wonderful job last week culling 30 great resources from the world wide web for finding data. Yesterday another site launched – Factual, making great resource number 31. We are excited to see a growing number of companies spring up that in turn increase everyone’s access to data. Solving the problems with data online is no small task fit for any single player. It’s a team effort, which we are proud to be a part of.

We thought we would take a minute today to talk about the problems as we see them, and how players within the online data market are choosing to tackle these problems.

The first problems are finding and sharing data. Most of these sources already solve this problem. Socrata and Factual let users upload data onto their sites, and each company’s datasets are easily searchable along with what’s on and Numbrary.

There are also other, more technical issues. Swivel, Socrata, Factual, Many Eyes – all of these websites allow users to play around with data live on the site. This opens up costly issues for the hosting company.

1. The data has to live in their platform and reconcile with the whole.
2. Many new datasets are on the order of gigabytes in size.

Whereas datasets on Infochimps can be of any size, format, or shape, their datasets must be in a standard csv/tsv/xls format and are limited to a few hundred megabytes. In reality, statisticians want data in .sas formats, and geographical data comes in .gis formats. Because of the larger size of today’s datasets, tools within a browser will be insufficient to work with and understand the data, and a person’s options for distributing that data are also limited.

Data, especially valuable data, is often proprietary. The owners of that data won’t release it unless there are clear licenses and terms of use. We differ from these other open data players in our commitment to host open data for free and maintain our open data commons for everyone’s benefit, but we will also host licensed data. Unfortunately, open data doesn’t include all of the data in the world. Instead, what we offer organizations is the ability to permit only users that have agreed to a license or paid for access to download their data. As the data marketplace grows, we believe more and more buyers will realize the value proposition in looking for data on Infochimps. Our aim is to give incentive to the long tail of businesses with data gathering dust on hard drives that could otherwise be useful to another person or organization.


  1. Bryan Connor October 26, 2009 at 3:12 pm

    I think there’s sort of a Data bubble growing right now. Explosive amounts of data are being published to the web every day and the new task isn’t sorting through individual datasets for valuable information but is now searching through datasets for valuable ones.

    Data is something that needs to be regulated and controlled so that the internet’s wealth of information doesn’t end up in hundreds of different formats, each requiring different tools to work with. Of course different types of data requires different formats but there needs to be a set of standard formats, making all information more accessible and regulated for easy consumption and use. At least i think so.

  2. Stephen McDaniel October 16, 2009 at 2:11 pm

    It is not true that SAS is proprietary; otherwise applications like WPS (World Programming System) would be incapable of reading SAS datasets. WPS is perfectly capable of reading SAS datasets and indeed uses the SAS language. The main impediment to reading SAS datasets has been technical (30 years of engineering “tweaks” by SAS to their file structure) rather than a “closed” data format.

    I really like your model and agree there is a place for both open and closed data. I have friends who maintain specialized databases for resale; often requiring 4, 6, or even 10 full-time employees to incorporate multiple data sources and agencies to maintain these databases that are resold. Obviously, they must sell these databases in order to cover their costs and hopefully attain profitability.

    My biggest concern with open data and information repositories is the actual quality of the data and the possibility of manipulation for various purposes. This has been well-documented on Wikipedia and even in various scientific journals. Hopefully, the open use of these sources will uncover major errors or omissions.

    Best regards,
    Stephen McDaniel
    Author- “SAS for Dummies”
    Principal and Co-Founder, Freakalytics™ LLC
    Rapid Analytics to Explore, Understand, Communicate & Act™

  3. Hadley Wickham October 14, 2009 at 3:12 pm

    Statisticians do NOT want data in proprietary data formats like those that SAS uses – we want it in open/free formats, just like everyone else!

  5. Mike Roberts October 14, 2009 at 2:34 pm

    Well said. The next frontier is making sense of the data (visual, tables, etc.). Love the site!