The Past, Present and Future of Data

Yesterday, our CEO, Nick Ducoff presented at Data Content, an Infocommerce conference. In this presentation geared towards fellow data publishers, Nick takes us through a history of information and his thoughts on the future and where Infochimps fits into the puzzle. ┬áIf you’d like to review a full transcript of his presentation, you can check it out after the jump. Enjoy!

Hi everyone, my name is Nick Ducoff and I’m the CEO and co-founder of Infochimps. I am excited to tell you about what we’re doing at Infochimps but first a bit of a history lesson, starting with how I got here in front of you, and then how data has become the topic de jour.

I guess you can say I’ve been web-curious since the early 90s. I was a hacker, and early adopter of BBS’s and mIRC. It wasn’t until college though that I got my feet wet in e-commerce helping run Collegeboxes, which was an internet enabled shipping and storage business for college students. We sold the business in 2008 to Store to Door.

Most of what is now referred to as Web 1.0 was taking offline businesses and scaling customer bases through online sales. Businesses were no longer limited by how many phones they could answer or how many transactions they could process at the cash register. Businesses were instead limited by their fulfillment and shipping operations. However, high speed internet wasn’t yet pervasive and download speeds limited the sale of information online. The data business was still largely an offline business, limited by how many phones could be answered. I am sure many of you remember that.

Web 2.0, driven by software as a service (or Saas), and social web sites, has allowed businesses to profit without even being limited by their fulfillment and shipping operations. Businesses can scale today as fast as they can find new customers, in many cases without any human interaction at all. As an example, in law school I started one of the first social/professional networks for law students, called JDspace, which enabled law students and employers to bypass the in person on campus recruitment process. As another example, the co-founder of Collegeboxes now has a new venture-backed peer-to-peer storage startup based right here in Philly called Storably. Has anyone heard of it? If you’re local, check it out.

These advancements in online business models, as well as high speed internet, has enabled cloud businesses, including Infochimps. At Infochimps we’re connecting open and commercial data and making it all accessible in a unified platform through APIs. At the moment much of the data is also available to be downloaded locally, but I often make the analogy that the download data business is similar to the DVD by mail business.

Eventually all data, including content, will be transmitted via the ether. I hope some of you disagree, and I look forward to chatting about it over drinks tonight.

So how’d we get here? I started the talk with my own experience in the 90s, but let me back in to it, and then talk about how I think it looks over the next 10 years.

5,000 years ago Babylon began systematically recording information. The Babylonian census is generally agreed to be the first of its kind. As with most things, the census was created for financial purposes, specifically to identify the society’s tax base. Now, on twitter, there are even records of when people go to the bathroom. It isn’t clear to me the financial incentive for that, but Charmin now has a direct marketing channel.

Around the 13th century BC, the Library at Thebes was the first known effort to collect and make many sources of information available in one place. This was Infochimps 1.0, though they were a bit ahead of their time.

The Library at Alexandria took storing information to a whole new level. The library was the first of its kind to aggregate sources from beyond it’s country’s borders. It is estimated that the library may have contained over half a million scrolls. Unfortunately they didn’t have fire insurance, and weren’t backing up their data in the cloud.

Not yet digital, but a far improvement from scrolls, codices permitted random access to information. Bookmarks were invented shortly thereafter.

Gutenberg’s invention of the printing press in the mid 1400s enabled mass production and distribution of information. The printing press was an important step towards the democratization of knowledge. Our mission at Infochimps is to democratize access to structured information.

Moving along, infographics, including this famous one by Charles Minard 150 years ago depicting Napolean’s march, helped humanize information. I will come back to this in a bit.

This past century brought about the changes that bring us all here today. Information moved from paper to disc and now to the ether.

So who cares?

Well, we do, or we wouldn’t be here. There is a lot of talk about information becoming a commodity. In aggregate, I think this is true. There is a ton of data being created and replicated, nearly 1.8 zettabytes this year alone.

And this is how looks to most people.

At Infochimps, we’re trying to present it more like this. Gartner recommends Infochimps to anyone who has had to toil away for hours trying to find, format and sort data into useful formats. Can I see a show of hands if you’ve experienced that pain?

That’s the first pain we set out to solve. The next is business intelligence. This is data, specifically temperatures in various locations around the Polish-Russian border along with losses suffered by Napolean’s army. This you might recall is Charles Minard’s famous infographic depicting this data. The lesson is clearly to not get involved in a land war in Asia.

Here is a gross representation of our ETL process. We extract data from tables on web pages, open APIs and commercial data sources. We store the data on AWS in a variety of data stores depending on the type and size of the data. We loosely connect the data and do lightweight transformation, including augmentation, completion, and normalization. We then make this data, currently over 15,000 data sets, available through our web site.
We have well over 200 sources of data, including Bundle which is aggregated credit card data from Citigroup, weather data from NOAA, venue checkin data from Foursquare, nearly 10 billion tweets from Twitter, online influence from Klout, social identity mapping from Qwerly, and retail location data from AggData. We believe the value of data is in the connections. For instance, how does weather affect sales? What are the most popular locations of a retailer based on venue checkins? How do you choose who to direct market to online, and how do you reach them?

These companies understand their data is valuable, but that the value of the data is greater when joined with other data. Some companies, including Liquid Robotics, have come to the realization that their data is more valuable than their product. Data is becoming the product. This is going to make it harder to go it alone.

So I am here to get you guys on board to go it together. How many of you had heard of Infochimps? How many of you have considered selling your data on Infochimps? I wanted to use a few of you as examples. Locationary is a local places database (and is a current Infochimps supplier), Chain Store Guide has restaurant and retail foodservice information, Yipit has daily deals and consumer purchasing pattern data, Zoominfo has business profile and company information, Vitals has data doctors, LexisNexis of course is a case law database, and HG data has supply chain data (and is a prospective Infochimps supplier). This data when aggregated can help answer questions, such as: How can I map daily deals and get additional information about those retailers? Who are the authorized resellers of a product and how can I get additional information about those businesses? Which doctors have been subject to malpractice or other litigation?

I hope you will join me in building the future of data. My email is nick@infochimps.com.

Comments are closed.