The Infochimps Blog

Big data insights, news, and tips straight from the Data Mine

Announcing the Infochimps Platform for Big Data


The Age of Big Data
Readers of this blog are no strangers to the problems that Gartner declares to be the hallmarks of our age of Big Data – volume, variety, and velocity. Nor would I consider Infochimps community members dark to the fact that there are tons and tons of wealth contained in the world’s data, both internal and external to the organization.

What’s rarely admitted, however, is how difficult it can be to wrangle these data sets and operate the systems to process them. Running Hadoop and other distributed data architectures in the cloud is still a massive challenge, something typically managed by the data and operations elite. The demand for data science talent is growing and growing, setting salaries for these skilled individuals to ranges only the wealthiest enterprises can afford.

The Vision Behind Infochimps
When Infochimps was born, the co-founders set out with a mission that was deceptively simple – increase access to the world’s data. We understood that one of the first things that made this hard for people was actually finding the data, as search engines don’t really work for tables and spreadsheets. The Infochimps catalog was born, and from that the Infochimps Data Marketplace as a way to incentivize content providers to make their data more open and available.

The Data Marketplace has been wonderfully successful. Hundreds of thousands of visitors have downloaded data from our catalog of over 15,000 data sets sourced from over 200 suppliers, including Bundle, Foursquare, and Twitter. Thousands of application developers from the likes of Sheckys, Summify, and Crimson Hexagon, have leveraged our data to make their apps more rich and compelling.

But we’ve always known that it’s not enough. Raw data is just the fuel. Without an engine to make it into something productive for the individual or organization, it’s doomed to not live up to its promise.

A Platform to Solve Our Own Problems
How do you get the world’s data to live in one place? This is no simple problem. Every day you’re dealing with the three major challenges quoted above. Some data sources update weekly, some by the minute, and others stream data to you at many GB’s per hour. Data can come in a tabular format, a JSON string, or a giant blob of text. Not to mention the sheer volume of sources and data you’re faced with warehousing.

From the beginning, Infochimps has used Amazon Web Services (AWS), Hadoop, and a number of other Big Data technologies to source and aggregate the world’s data. Faced with the resource and personnel constraints of a typical startup, we began with a simple best-effort design approach, allowing our small team of data engineers to get away with moving massive cloud resources around with minimal effort. We developed Wukong to make it easy for our Ruby developers to run Hadoop jobs, and extended Chef into Ironfan (formerly known as Cluster Chef) to make the instantiation and management of our infrastructure so simple our engineers can “move cities with their minds.”

Google rocked the world when it released its Map Reduce paper, inspiring what became Hadoop, and allowing the rest of the world to take advantage of the tools it developed for its own data gathering efforts. In a similar vein, it is our hope that the release of our own internal technologies as a Platform product may help the world’s organizations to gather and manage the world’s data for their own purposes.

Context – the Next Level
A recent New York Times article featured some of the analytics done by Target, where marketers there had been able to figure out that a woman was pregnant based on her purchase patterns. This type of insight is remarkable and only marks the beginning of what’s to come as all our purchases, clicks, and check-ins are tracked and analyzed. Organizations will be able to take this only so far; however, if they restrict their imaginations to just their own data.

The next big leap for the world’s organizations will be how they use all of these new and developing information streams – from Google search traffic, tweets, 100 years of weather measurements, check-ins, and UFO sightings. In the financial world, researchers have demonstrated that Google search query data can predict inflation metrics, weeks before the official numbers come out. Ecommerce websites have long used data like our IP-Geolocation to personalize web experiences to increase conversions.

The Infochimps Data Marketplace has helped us all appreciate the breadth of data the world has to offer. Now, we can help those organizations that want to use this data to find insight, increase revenues, and cut costs.

Interested? Want to know more?
The Infochimps Platform is made up of a suite of technologies we’ve developed internally, plus a number of open source software that we’ve developed tools and techniques for managing. The Platform comes with the brains and experience of the brilliant Infochimps team in order for you to maximize your return on a Big Data infrastructure investment.

For more information about the Platform, please use our contact form here.

We are excited to hear from you!

Winner of the Strata 2012 Conference Pass

Thanks to the random number generator, we’ve selected a winner amongst the folks who entered.  Congrats to #22 aka Nicolas Thiébaud.  And we swear… it’s not because he promised us French pastries, though we are excited for the rising Hadoop community in his home country!

We’ll see you at Strata!

Win a Conference Pass for Strata 2012!

We’re gearing up for Strata 2012 and the wonderful folks at O’Reilly have bestowed upon us one Conference Pass for Strata 2012 that we will be giving away to one of our lucky blog readers!  This pass, worth $1345, gives you access to two amazing action packed days of panels, discussions and workshops at Strata 2012.

All you have to do is comment on this blog post!  That’s it!  We’ll select one winner on Friday, February 17th at 5pm CT and that person will receive a super secret code that will comp them for a Conference Pass.  How cool is that?

Want to learn more about the conference?  We’ve got the full rundown, plus a 20% off discount code for you after the jump!

(more…)

Infochimps at Strata Conference 2012

We’re excited to have our CTO, Flip Kromer presenting a talk at Strata Conference in Santa Clara later this month.  The discussion centers around disambiguation.  Now you might be wondering… what is disambiguation?  Simply put, disambiguation is the process of resolving conflicts to remove ambiguity.  We’ve discussed this topic a number of times in this blog and Flip will be presenting on how this concept affects the way we ask questions and find answers about Big Data.

For more details on the talk, check out the Strata schedule.

The Rise and Fall of the Fortune 500

Ben Fry of Fathom Information Design put together this elegant  interactive visualization of publicly available Wikipedia data around the Fortune 500, America’s largest corporation.  His intent was to show how 84,000 data points could be easily viewed and navigated in one interactive piece.  We think he did an amazing job using the clean, simple display to tell rich stories of company histories and the rise and fall of our country’s top corporations.

One company that stands out in our minds is Eastman Kodak, who enjoyed growing revenues and steady profitability for decades.  Then, in 1990, Logitech came out with the Dycam Model 1 black-and-white digicam, the world’s first completely digital consumer camera and the following decades only saw the further proliferation and now dominance of this technology.  Kodak never quite stayed with the trend and this lead to their falters and declaration of bankruptcy just a few weeks ago.

Fry’s interactive visualization is chock full of other amazing stories and insights.  We highly recommend checking it out.  Want to make something awesome with Wikipedia data?  We’ve got an API for that.

Fixies and Hipsters are… Correlated?

Depending on who you are, the sight of a gorgeously simple yet eclectic fixed gear bicycle may make your mouth water or may fill you with ire.  Perhaps if you feel the former, you are the current owner of several pairs of skinny jeans, a pearl snap vintage shirt and ironic glasses.  In other words, you are a hipster.

According to the folks on Quora, fixed gear bicycles (or fixies) are considered to be a strong indicator of hipsterness.  The folks at Priceonomics blog, as part of their effort to build a comprehensive bicycle pricing guide, have measured what kinds of used bicycles people sell and the quantity sold in cities across the US.  To find where the hipsters live, they mined their database of 1.3 million bicycle listings to determine where the various markets for used fixed gear bicycles existed and which were the strongest (most sales) and therefore likely had the highest number of hipsters.

Surprisingly, places commonly thought of to be high in hipster density, including San Francisco and Portland do not top the list.  Commonly thought of hipster mecca, Brooklyn (NYC) doesn’t even make the top 25.  (You can see the full list here.)  We’re pleasantly surprised that our hometown of Austin, TX ranks below Boise in hipsterness (at least as indicated by used fixed gear bicycle sales).

Now, this is a bit of a silly parallel to draw and certainly does not take into account the bike-ability of a city, let alone the individual reasoning various folks have for riding fixed gear bicycles, but it’s nevertheless a fun analysis of a massive corpus of bicycle pricing data.

The Best Pie Chart Ever

Thanks, ilovecharts.

How long does it take for a cockroach to die?

Earlier this week, YouTube revealed that users are uploading one hour of video every second to the site.  It’s quite the amazing milestone, not only speaking to YouTube’s massive success, but also the mind-boggling rate at which we are producing data.  Furthermore, it was revealed that the average YouTube visitor spends an average of 15 minutes a day on the site, accounting for a total of 4 billion video views per day.

It can be overwhelming for most to understand the sheer size of these numbers, so to help put things into perspective, YouTube has created One Hour Per Second.  You’ll see some interesting comparisons, such as the one above, which shows that 3 minutes and 36 seconds of uploads to YouTube is equal to 9 days or the time it takes for a decapitated cockroach to die.  Yikes.

Economic Outlook: Mostly Typical

Using major macroeconomic indicators, Russell Investments has created a dashboard to capture a snapshot of the state of our economy.  It’s updated on the 22nd of each month with data from Bloomberg.

You can click through the “Historical Details” links to read more about each indicator and its see its changes over time.  Check out the legend below for complete details on how to read the chart.

So, what does this dashboard tell us about the current state of our economy?  For starters, we are growing at a modest 1.8%.  As you can see from the chart, most indicators are well within “typical” range and even mortgage delinquencies and corporate debt are slowly coming down.  I’ll be sure to keep my eyes peeled for updates to this slick little dashboard.

Are You OPEN to the New Anti-Piracy Bill?

OPEN vs SOPA vs PIPA

Two days after the Internet Blackout that saw supporters of SOPA and PIPA changing their minds, a new bill was introduced into the House of Representatives by Rep. Darrell Issa of California. Issa was one of the most vocal critics of SOPA and PIPA and his proposed bill, known as the OPEN Act (Online Protection & ENforcement of Digital Trade Act) offers a more laser-focused solution to online piracy by foreign rogue sites.

What is most impressive about this new bill is the approach taken by Issa and the clearly superior understanding he has of the web than many SOPA supporters. Issa’s office has set up a website, Keep the Web Open that offers numerous resources for understanding the bill, as well a new tool called “Madison”, which allows users to read, share, edit and comment on the text of the bill. A politician who adds transparency to the conversation and listens to the nerds?  That sounds pretty sweet to us.

The Daily Show With Jon Stewart Mon – Thurs 11p / 10c
KO Computer
www.thedailyshow.com
Daily Show Full Episodes Political Humor & Satire Blog The Daily Show on Facebook
Older posts »