The Infochimps Blog

Big data insights, news, and tips straight from the Data Mine

Announcements

Announcing Support for OpenStack and the Rackspace Cloud

Infochimps is happy to announce that we now support the next generation Rackspace Cloud, based on OpenStack. Through integration with the OpenStack API the Infochimps Platform can now power big data applications based in the Rackspace Cloud, expanding the reach of the Infochimps Platform and making the running of complex big data infrastructures quick and easy for a broader range of users.

Rackspace customers running the new OpenStack-based Rackspace Cloud Servers can quickly and easily spin up Hadoop clusters to power their big data applications in as little as 20 minutes with a single command using the Infochimps Platform. With the power of Ironfan, Infochimps’ open source provisioning tool, and Dashpot, Infochimps’ visualization and operations dashboard, customers can easily monitor and manage their Big Data operations on an ongoing basis, or leave it to Infochimps to manage it on the Rackspace Cloud for them.

Check out this demo of Infochimps Platform running in the Rackspace Cloud:

Why OpenStack and Rackspace?
From the beginning, the Infochimps Platform has been built on a foundation of open source tools for managing data, aimed at simplifying the experience of working with complex technologies such as Hadoop or Cassandra. Within the Infochimps Platform, Wukong, Ironfan and Swineherd are major open sourced components of the stack. OpenStack supports our open source tradition with its strong open source ecosystem. It is used by and contributed to by not only Rackspace, but organizations such as NASA, Canonical, RedHat, Dell, HP, and AT&T, so its architecture serves a multitude of needs, rather than bending to the whims of a single provider.

OpenStack also encourages standardization among Infrastructure as a Service providers, which ultimately benefits everyone in the market. Clients can make (and remake) decisions based on their businesses’ current day to day needs, without needing to employ a crystal ball to try to predict which provider will be best for them in the long term. By sharing open and standard interfaces, cloud providers can compete on current quality and value, instead of fighting to lock-in customers based on promises.

The modular design of OpenStack is part of what makes standards possible without blocking innovation. There are a set of core APIs that every provider will support, and extensions for added capabilities that not every provider will want to allow. The contracts these APIs provide can be (and often are) fulfilled by different back-end providers, letting each provider make different architectural choices without requiring customers to completely retool to take advantage of them. All of this allows apples-to-apples comparison of provider architectures, without making orange sales impossible.

What does OpenStack mean for Infochimps?
The work we’ve done to support this announcement has enabled us to provide a level of abstraction from the Amazon Web Services environment, and we can deploy our platform in a cloud agnostic way. Many of our customers have asked for implementations on their in-house cloud environments – our OpenStack support allows those implementations to be airlifted in using a common set of APIs that sit on top of whatever infrastructure already exists, instead of one-off installations that require more custom development and introduce brittleness.

Interested in learning more about Infochimps, Rackspace, and OpenStack? Contact us today for more information!

Announcing Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform

Infochimps is happy to announce Dashpot, an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Dashpot gives you real time visibility and control of your Big Data stack running with Infochimps, helping you go from input to insight faster, with our best-in-class Big Data infrastructure and tools.

Here are some of Dashpot’s key features:

  • Business Metrics – Dashpot’s in-stream visualization provides business users with the ability to capture and visualize business metrics on the fly as data is being ingested into their Infochimps Platform. By enabling data to be decorated in-stream through our Flume-based Data Delivery Service, Infochimps enables quick introspection on how a data or business process is performing. Organizations can view spikes or drops in key system or business metrics in near real-time, enabling quicker response to changing business conditions, saving time and helping ensure higher quality and more valuable information in the organization’s ultimate datastore. Infochimps business metrics are designed to provide an intermediate data visualization capability in conjunction with an organization’s existing investments in traditional business intelligence solutions.
  • Cluster Management – Built on the power of Ironfan, Dashpot offers simple Big Data system automation and management with a quick glance view into the servers and clusters currently running. Operations users can easily spin them up and down with a simple button click as their processing needs change, creating significant, easy-to-attain cost savings in machine usage.
  • Systems Monitoring – Dashpot provides integration with popular monitoring packages to provide users with at-a-glance views on Big Data system performance, availability, system integrity and more. Designed to easily integrate with any monitoring product, Infochimps has implemented the popular open source product, Zabbix as its initial reference monitoring solution, integrating Zabbix graphs on system performance and availability in the Infochimps Dashpot dashboard.

Implementing and operating Big Data architectures can be difficult, requiring significant investment of resources and time. By choosing to use the Infochimps Platform, enterprises needn’t worry about the time and hassle of building and maintaining their own infrastructure. When combined with our tools, such as Ironfan and DDS, Dashpot’s simple visualizations and management tools help organizations keep their Big Data system humming, with little operational overhead. Best of all, Dashpot’s in-stream visualizations help provide the insights businesses need to get the most value out of their Big Data infrastructure investment.

Interested in talking about how we can help simplify your Big Data stack?  Contact us today for more information!

Announcing the Infochimps Platform for Big Data


 

The Age of Big Data
Readers of this blog are no strangers to the problems that Gartner declares to be the hallmarks of our age of Big Data – volume, variety, and velocity. Nor would I consider Infochimps community members dark to the fact that there are tons and tons of wealth contained in the world’s data, both internal and external to the organization.

What’s rarely admitted, however, is how difficult it can be to wrangle these data sets and operate the systems to process them. Running Hadoop and other distributed data architectures in the cloud is still a massive challenge, something typically managed by the data and operations elite. The demand for data science talent is growing and growing, setting salaries for these skilled individuals to ranges only the wealthiest enterprises can afford.

The Vision Behind Infochimps
When Infochimps was born, the co-founders set out with a mission that was deceptively simple – increase access to the world’s data. We understood that one of the first things that made this hard for people was actually finding the data, as search engines don’t really work for tables and spreadsheets. The Infochimps catalog was born, and from that the Infochimps Data Marketplace as a way to incentivize content providers to make their data more open and available.

The Data Marketplace has been wonderfully successful. Hundreds of thousands of visitors have downloaded data from our catalog of over 15,000 data sets sourced from over 200 suppliers, including Bundle, Foursquare, and Twitter. Thousands of application developers from the likes of Sheckys, Summify, and Crimson Hexagon, have leveraged our data to make their apps more rich and compelling.

But we’ve always known that it’s not enough. Raw data is just the fuel. Without an engine to make it into something productive for the individual or organization, it’s doomed to not live up to its promise.

A Platform to Solve Our Own Problems
How do you get the world’s data to live in one place? This is no simple problem. Every day you’re dealing with the three major challenges quoted above. Some data sources update weekly, some by the minute, and others stream data to you at many GB’s per hour. Data can come in a tabular format, a JSON string, or a giant blob of text. Not to mention the sheer volume of sources and data you’re faced with warehousing.

From the beginning, Infochimps has used Amazon Web Services (AWS), Hadoop, and a number of other Big Data technologies to source and aggregate the world’s data. Faced with the resource and personnel constraints of a typical startup, we began with a simple best-effort design approach, allowing our small team of data engineers to get away with moving massive cloud resources around with minimal effort. We developed Wukong to make it easy for our Ruby developers to run Hadoop jobs, and extended Chef into Ironfan (formerly known as Cluster Chef) to make the instantiation and management of our infrastructure so simple our engineers can “move cities with their minds.”

Google rocked the world when it released its Map Reduce paper, inspiring what became Hadoop, and allowing the rest of the world to take advantage of the tools it developed for its own data gathering efforts. In a similar vein, it is our hope that the release of our own internal technologies as a Platform product may help the world’s organizations to gather and manage the world’s data for their own purposes.

Context – the Next Level
A recent New York Times article featured some of the analytics done by Target, where marketers there had been able to figure out that a woman was pregnant based on her purchase patterns. This type of insight is remarkable and only marks the beginning of what’s to come as all our purchases, clicks, and check-ins are tracked and analyzed. Organizations will be able to take this only so far; however, if they restrict their imaginations to just their own data.

The next big leap for the world’s organizations will be how they use all of these new and developing information streams – from Google search traffic, tweets, 100 years of weather measurements, check-ins, and UFO sightings. In the financial world, researchers have demonstrated that Google search query data can predict inflation metrics, weeks before the official numbers come out. Ecommerce websites have long used data like our IP-Geolocation to personalize web experiences to increase conversions.

The Infochimps Data Marketplace has helped us all appreciate the breadth of data the world has to offer. Now, we can help those organizations that want to use this data to find insight, increase revenues, and cut costs.

Interested? Want to know more?
The Infochimps Platform is made up of a suite of technologies we’ve developed internally, plus a number of open source software that we’ve developed tools and techniques for managing. The Platform comes with the brains and experience of the brilliant Infochimps team in order for you to maximize your return on a Big Data infrastructure investment.

For more information about the Platform, please use our contact form here.

We are excited to hear from you!

Winner of the Strata 2012 Conference Pass

Thanks to the random number generator, we’ve selected a winner amongst the folks who entered.  Congrats to #22 aka Nicolas Thiébaud.  And we swear… it’s not because he promised us French pastries, though we are excited for the rising Hadoop community in his home country!

We’ll see you at Strata!

Infochimps at Strata Conference 2012

We’re excited to have our CTO, Flip Kromer presenting a talk at Strata Conference in Santa Clara later this month.  The discussion centers around disambiguation.  Now you might be wondering… what is disambiguation?  Simply put, disambiguation is the process of resolving conflicts to remove ambiguity.  We’ve discussed this topic a number of times in this blog and Flip will be presenting on how this concept affects the way we ask questions and find answers about Big Data.

For more details on the talk, check out the Strata schedule.

Same awesome data, Sweet new website

As an early Christmas present to ourselves, we’ve introduced a sweet new website meant to help our site visitors more easily navigate to the data products and solutions they need.  In the new design, we highlight our top data products: Social, Geo and Data Marketplace (where you can still access over 15,000+ downloadable data sets and APIs), as well as the data expertise we can bring to table.

Take a peek around the site and let us know what you think.  We’ve got more updates and changes in store over the next few months and we’d love your direct feedback as we iterate towards awesomeness.

Once a Chimp, Always a Chimp

Having had the privilege to be involved with Infochimps since its founding in the summer of 2009, and having led the company for the last year as CEO, it is with mixed feelings to announce that I will be reducing my day-to-day responsibilities with the company. In the interim, my co-founder Joe Kelly will be taking the reins. Having worked with Joe since the beginning, I know it will be a smooth hand off and the company will be in good hands as we expand. I will continue to be involved with Joe and the team as a Board Advisor.

In my time as CEO, we closed two rounds of financing, grew a tremendous user base, and built a best in class engineering team, including those that joined us through our acquisition of Keepstream. Our data catalog now boasts over 200 suppliers including Twitter and Foursquare, and with over 10,000 customers we’re well on our way toward our mission of democratizing access to data.

I’m excited to take what I’ve learned at Infochimps and all the friends I’ve made and apply it to something new and exciting. I look forward to what’s next, but am equally excited to continue to help the Infochimps team build the best data company in the world!

Transitioning to Lean at Infochimps

Two nights ago, my fellow chimps, Dhruv Bansal, Tim Gasper and I gave a presentation at the Austin Lean Startup Circle on the company’s recent transition to lean. We discussed our switch to a lean product strategy driven by must-have customer problems and the lean concepts and tools we have used to get there. It’s chock full of insights, struggles and great ideas for startups looking to adopt the Lean methodology.

For a version with full audio, check out it out on Posterous.

Become a Chimp… We’re Hiring!

Do you love accessing cool data but hate scraping, cleaning and parsing it all day long? Apparently so do a lot of people! Come work for us and be a hero to developers everywhere who just want an easy place to access the data the want.  Check out our current open positions: Architect, Data Engineer, Data Scientist, Head of Marketing.

Here are just a few of the great things about working at Infochimps:

  • A world class team of friendly people eager to tackle hard problems
  • Ask around, we have one of the finest data science and scalable backend teams in the world
  • Convenient location in downtown Austin, a city ranked Kiplinger’s #1 city for the next decade and Forbes #1 best bargain city
  • Delish lunches brought in everyday, free for employees
  • All the bananas you can eat
  • Competitive salary and options
  • Health insurance benefits, fully paid for employees
  • If you want to be part of our team, please send a resume and details about why you would be excited to work at Infochimps to jobs@infochimps.com.

Look forward to hearing from you. Please feel free to let us know if you have any questions!

Meet Jim the Monkey + Other Website Updates

Meet Jim the Monkey, the friendly greeter on our newly redesigned sign up page. Coincidentally, he shares a name with our new Director of User Experience, Jim England who has busily been improving key areas of Infochimps.com.  As you may recall, Jim (the human), formerly of Keepstream, joined just a few months ago and had already made some huge headway in making our site more user friendly, easier to navigate and just a wee bit cuter with the addition of Jim (the monkey).

Whether you’re a new visitor or a long-time fan of Infochimps, we’d love to know what you think of the changes we have underway!

Leave us a comment, send us a tweet, or send us an email with your thoughts!

New Header



Our new header compresses the best elements of our old one into a sleeker, easier to navigate design.  The upper part in lighter grey now holds our search bar as well as our key navigational elements.  The lower part in darker grey helps users navigate to our most popular API offerings, as well as access their account.  Bonus – when you’re logged in, the dark grey bar becomes our account navigator with quick links to your profile, API dashboard (complete with usage charts) and account settings.

(more…)

Older posts »