- February 22, 2012
The Age of Big Data
Readers of this blog are no strangers to the problems that Gartner declares to be the hallmarks of our age of Big Data – volume, variety, and velocity. Nor would I consider Infochimps community members dark to the fact that there are tons and tons of wealth contained in the world’s data, both internal and external to the organization.
What’s rarely admitted, however, is how difficult it can be to wrangle these data sets and operate the systems to process them. Running Hadoop and other distributed data architectures in the cloud is still a massive challenge, something typically managed by the data and operations elite. The demand for data science talent is growing and growing, setting salaries for these skilled individuals to ranges only the wealthiest enterprises can afford.
The Vision Behind Infochimps
When Infochimps was born, the co-founders set out with a mission that was deceptively simple – increase access to the world’s data. We understood that one of the first things that made this hard for people was actually finding the data, as search engines don’t really work for tables and spreadsheets. The Infochimps catalog was born, and from that the Infochimps Data Marketplace as a way to incentivize content providers to make their data more open and available.
The Data Marketplace has been wonderfully successful. Hundreds of thousands of visitors have downloaded data from our catalog of over 15,000 data sets sourced from over 200 suppliers, including Bundle, Foursquare, and Twitter. Thousands of application developers from the likes of Sheckys, Summify, and Crimson Hexagon, have leveraged our data to make their apps more rich and compelling.
But we’ve always known that it’s not enough. Raw data is just the fuel. Without an engine to make it into something productive for the individual or organization, it’s doomed to not live up to its promise.
A Platform to Solve Our Own Problems
How do you get the world’s data to live in one place? This is no simple problem. Every day you’re dealing with the three major challenges quoted above. Some data sources update weekly, some by the minute, and others stream data to you at many GB’s per hour. Data can come in a tabular format, a JSON string, or a giant blob of text. Not to mention the sheer volume of sources and data you’re faced with warehousing.
From the beginning, Infochimps has used Amazon Web Services (AWS), Hadoop, and a number of other Big Data technologies to source and aggregate the world’s data. Faced with the resource and personnel constraints of a typical startup, we began with a simple best-effort design approach, allowing our small team of data engineers to get away with moving massive cloud resources around with minimal effort. We developed Wukong to make it easy for our Ruby developers to run Hadoop jobs, and extended Chef into Ironfan (formerly known as Cluster Chef) to make the instantiation and management of our infrastructure so simple our engineers can “move cities with their minds.”
Google rocked the world when it released its Map Reduce paper, inspiring what became Hadoop, and allowing the rest of the world to take advantage of the tools it developed for its own data gathering efforts. In a similar vein, it is our hope that the release of our own internal technologies as a Platform product may help the world’s organizations to gather and manage the world’s data for their own purposes.
Context – the Next Level
A recent New York Times article featured some of the analytics done by Target, where marketers there had been able to figure out that a woman was pregnant based on her purchase patterns. This type of insight is remarkable and only marks the beginning of what’s to come as all our purchases, clicks, and check-ins are tracked and analyzed. Organizations will be able to take this only so far; however, if they restrict their imaginations to just their own data.
The next big leap for the world’s organizations will be how they use all of these new and developing information streams – from Google search traffic, tweets, 100 years of weather measurements, check-ins, and UFO sightings. In the financial world, researchers have demonstrated that Google search query data can predict inflation metrics, weeks before the official numbers come out. Ecommerce websites have long used data like our IP-Geolocation to personalize web experiences to increase conversions.
The Infochimps Data Marketplace has helped us all appreciate the breadth of data the world has to offer. Now, we can help those organizations that want to use this data to find insight, increase revenues, and cut costs.
Interested? Want to know more?
The Infochimps Platform is made up of a suite of technologies we’ve developed internally, plus a number of open source software that we’ve developed tools and techniques for managing. The Platform comes with the brains and experience of the brilliant Infochimps team in order for you to maximize your return on a Big Data infrastructure investment.
For more information about the Platform, please use our contact form here.
We are excited to hear from you!