- February 28, 2012
Monthly Archives February 2012
- February 22, 2012
The Age of Big Data
Readers of this blog are no strangers to the problems that Gartner declares to be the hallmarks of our age of Big Data – volume, variety, and velocity. Nor would I consider Infochimps community members dark to the fact that there are tons and tons of wealth contained in the world’s data, both internal and external to the organization.
What’s rarely admitted, however, is how difficult it can be to wrangle these data sets and operate the systems to process them. Running Hadoop and other distributed data architectures in the cloud is still a massive challenge, something typically managed by the data and operations elite. The demand for data science talent is growing and growing, setting salaries for these skilled individuals to ranges only the wealthiest enterprises can afford.
The Vision Behind Infochimps
When Infochimps was born, the co-founders set out with a mission that was deceptively simple – increase access to the world’s data. We understood that one of the first things that made this hard for people was actually finding the data, as search engines don’t really work for tables and spreadsheets. The Infochimps catalog was born, and from that the Infochimps Data Marketplace as a way to incentivize content providers to make their data more open and available.
The Data Marketplace has been wonderfully successful. Hundreds of thousands of visitors have downloaded data from our catalog of over 15,000 data sets sourced from over 200 suppliers, including Bundle, Foursquare, and Twitter. Thousands of application developers from the likes of Sheckys, Summify, and Crimson Hexagon, have leveraged our data to make their apps more rich and compelling.
But we’ve always known that it’s not enough. Raw data is just the fuel. Without an engine to make it into something productive for the individual or organization, it’s doomed to not live up to its promise.
A Platform to Solve Our Own Problems
How do you get the world’s data to live in one place? This is no simple problem. Every day you’re dealing with the three major challenges quoted above. Some data sources update weekly, some by the minute, and others stream data to you at many GB’s per hour. Data can come in a tabular format, a JSON string, or a giant blob of text. Not to mention the sheer volume of sources and data you’re faced with warehousing.
From the beginning, Infochimps has used Amazon Web Services (AWS), Hadoop, and a number of other Big Data technologies to source and aggregate the world’s data. Faced with the resource and personnel constraints of a typical startup, we began with a simple best-effort design approach, allowing our small team of data engineers to get away with moving massive cloud resources around with minimal effort. We developed Wukong to make it easy for our Ruby developers to run Hadoop jobs, and extended Chef into Ironfan (formerly known as Cluster Chef) to make the instantiation and management of our infrastructure so simple our engineers can “move cities with their minds.”
Google rocked the world when it released its Map Reduce paper, inspiring what became Hadoop, and allowing the rest of the world to take advantage of the tools it developed for its own data gathering efforts. In a similar vein, it is our hope that the release of our own internal technologies as a Platform product may help the world’s organizations to gather and manage the world’s data for their own purposes.
Context – the Next Level
A recent New York Times article featured some of the analytics done by Target, where marketers there had been able to figure out that a woman was pregnant based on her purchase patterns. This type of insight is remarkable and only marks the beginning of what’s to come as all our purchases, clicks, and check-ins are tracked and analyzed. Organizations will be able to take this only so far; however, if they restrict their imaginations to just their own data.
The next big leap for the world’s organizations will be how they use all of these new and developing information streams – from Google search traffic, tweets, 100 years of weather measurements, check-ins, and UFO sightings. In the financial world, researchers have demonstrated that Google search query data can predict inflation metrics, weeks before the official numbers come out. Ecommerce websites have long used data like our IP-Geolocation to personalize web experiences to increase conversions.
The Infochimps Data Marketplace has helped us all appreciate the breadth of data the world has to offer. Now, we can help those organizations that want to use this data to find insight, increase revenues, and cut costs.
Interested? Want to know more?
The Infochimps Platform is made up of a suite of technologies we’ve developed internally, plus a number of open source software that we’ve developed tools and techniques for managing. The Platform comes with the brains and experience of the brilliant Infochimps team in order for you to maximize your return on a Big Data infrastructure investment.
For more information about the Platform, please use our contact form here.
We are excited to hear from you!
- February 17, 2012
Thanks to the random number generator, we’ve selected a winner amongst the folks who entered. Congrats to #22 aka Nicolas Thiébaud. And we swear… it’s not because he promised us French pastries, though we are excited for the rising Hadoop community in his home country!
We’ll see you at Strata!
- February 14, 2012
We’re gearing up for Strata 2012 and the wonderful folks at O’Reilly have bestowed upon us one Conference Pass for Strata 2012 that we will be giving away to one of our lucky blog readers! This pass, worth $1345, gives you access to two amazing action packed days of panels, discussions and workshops at Strata 2012.
All you have to do is comment on this blog post! That’s it! We’ll select one winner on Friday, February 17th at 5pm CT and that person will receive a super secret code that will comp them for a Conference Pass. How cool is that?
Want to learn more about the conference? We’ve got the full rundown, plus a 20% off discount code for you after the jump!
- February 10, 2012
We’re excited to have our CTO, Flip Kromer presenting a talk at Strata Conference in Santa Clara later this month. The discussion centers around disambiguation. Now you might be wondering… what is disambiguation? Simply put, disambiguation is the process of resolving conflicts to remove ambiguity. We’ve discussed this topic a number of times in this blog and Flip will be presenting on how this concept affects the way we ask questions and find answers about Big Data.
For more details on the talk, check out the Strata schedule.
- February 1, 2012
Ben Fry of Fathom Information Design put together this elegant interactive visualization of publicly available Wikipedia data around the Fortune 500, America’s largest corporation. His intent was to show how 84,000 data points could be easily viewed and navigated in one interactive piece. We think he did an amazing job using the clean, simple display to tell rich stories of company histories and the rise and fall of our country’s top corporations.
One company that stands out in our minds is Eastman Kodak, who enjoyed growing revenues and steady profitability for decades. Then, in 1990, Logitech came out with the Dycam Model 1 black-and-white digicam, the world’s first completely digital consumer camera and the following decades only saw the further proliferation and now dominance of this technology. Kodak never quite stayed with the trend and this lead to their falters and declaration of bankruptcy just a few weeks ago.
Fry’s interactive visualization is chock full of other amazing stories and insights. We highly recommend checking it out. Want to make something awesome with Wikipedia data? We’ve got an API for that.