A Sneak Peek: Big Data for Chimps

  • Amanda McGuckin Hager

Big Data for Chimps 228x300 A Sneak Peek: Big Data for ChimpsYou may know leading data scientist, Flip Kromer, Infochimps co-founder and CTO. If you don’t, you soon will. O’Reilly is publishing his book “Big Data for Chimps, A Guide to Massive Scale Data Processing in Practice available for pre-order now. “Big Data for Chimps” is poised to bring an educational spin to those in the big data space that is unlike anything you may have read before. While beginners stand to gain quite a bit from reading the book, the book also appeals those experienced in modern programming techniques. Flip’s approach to technology builds the foundation throughout – using data at massive scale in very practical ways. That is, big data is about the data, and gaining value from it, not about the technologies.

Big Data for Chimps” will help you:

  • Discover how to think at scale by understanding how data must flow through the cluster to effect transformations
  • Identify the Big Data Infrastructure tuning knobs that matter
  • Learn the Big Data rules-of-thumb
  • Apply Hadoop to interesting problems through detailed example programs
  • Gain advice and best practices for efficient software development

You will be captivated and engaged through Flip’s creative use of examples, analogies and stories. When you read the book, you’ll experience: “Where is BBQ?,” “Pig Latin Translator,” “Patterns in UFO Sitings,” “Elephant and Chimpanzee Start a Business,” and more.

The chapter “Elephant and Chimpanzee Save Christmas” is especially commanding of your attention. Watch for it in January, as we’ll be releasing some chapters leading up to the publication.

Happy Holidays from all of us at Infochimps.  As our gift to you, here’s a little sneak peek at what’s in store, called “The Hadoop Haiku:”

data flutters by
elephants make sturdy piles
insight shuffles forth

47f18564 d70f 4a11 b8e3 f59ec64f85aa A Sneak Peek: Big Data for Chimps


Announcing Infochimps Enterprise Cloud

Infochimps Enterprise Cloud Announcing Infochimps Enterprise Cloud

Big Data is confusing to most executives. It’s this nebulous concept of applying technologies from Yahoo!, Facebook, Linkedin, and Twitter in such a way that the organization will truly become data-driven and, equally as important, be able to do so quickly. Unfortunately, only a few companies are really realizing its full potential.

That’s why Infochimps is announcing its Enterprise Cloud – A Big Data cloud service built specifically for Fortune 1000 enterprises who want to rapidly explore how big data technology can unlock revenue from their data. The Infochimps Enterprise Cloud addresses several challenges holding back executives from quickly gaining value from this disruptive technology.

Enterprises are only leveraging 15% of their data assets

Enterprises, on average, capture and analyze about 15% of their data assets. Typical data sources include transactional data (who bought what). However, a 360-degree view of the business requires a 360-degree view of the customer, as well as manufacturing, supply chain, finance, sales, marketing, engineering, etc.  Only by capturing 100% of the enterprise’s entire operational data and then supplementing it with external data (e.g. we’re talking to one pharmaceutical company about using claims data from 100+ health plans covering more than 70 million people), will you achieve maximum value from your data analytics. With the Infochimps Enterprise Cloud, you can not only combine 100% of your private data in a private cloud, but you can also supplement that data with another 100%+ of external data.

Time-To-Market constrained by infrastructure deployments

The deployment of, and value creation from, new disruptive big data technologies (Hadoop, NoSQL, in-stream processing) still takes a considerable amount of time, human and financial resources. Typical Enterprise Data Warehouse projects take 18-24 months to deploy. Simple changes to star-schema data models take 6 months minimum to be made available to internal development organizations. Hadoop projects, although less complicated than EDW, take about 12 months to deploy. With the Infochimps Enterprise Cloud, you can deploy value in 30 days.

Big Data talent hard to find

When I read articles about the gap between supply and demand for big data talent, I think to myself, “this is not a situation where analysts are collecting a sample of 10 companies and then generalizing it to the entire market.” It’s a real problem. If you are some “antiquated” Fortune 1000 company (you know who you are) looking to hire crazy smart engineers and data scientists from Facebook…well, sorry…you don’t have the corporate culture or the exciting environment that this talent enjoys. McKinsey forecasts that the demand and supply of talent needed is only going to get worse (60% gap by 2018). With the Infochimps Enterprise Cloud, you can leverage your existing talent. This is done by providing a simple but powerful abstraction between your application development team and the complex big data infrastructure.

One Big Data technology does not fit all

There are literally hundreds of DBMS / data store solutions today, supporting many different advantages based on data type and use-case. This creates the problem where business users and application developers get lost in the nuances associated with data infrastructure, and lose focus on the business needs. Don’t listen to a single data store vendor tell you that they can address all your business needs. You need several. With the Infochimps Enterprise Cloud, we force you to start with the business problem first, then we draw from a very comprehensive data services layer which addresses the needs of the business problem. Guess what? It’s not just Hadoop.

Infrastructure and data integration is the most challenging

Knowing how to integrate existing data infrastructure with new big data infrastructure and then complicating this with external data sources, makes integration a completely new problem. This is not a matter of simply upgrading your ETL tools. With the Infochimps Enterprise Cloud, we help you understand the “new ETL” used by our web-scale friends.

Open source is cheap, but not easily commercialized

Silicon Valley has created over 250,000 open source projects alone. Disruption is obviously occurring within the open source community. However, enterprises are not in a position to properly deploy, even with the many commercialization vendors. How does a company integrate several open-source solutions into one? With the Infochimps Enterprise Cloud, we support an end-to-end big data service, which consists of many commercial open source projects combined to offer real-time stream processing, ad-hoc analytics, and batch analytics as one integrated data service.

Data security + data volume both dictate deployment options

Only non-sensitive, publicly available data sets (e.g. Twitter) are using elastic public cloud infrastructure. Compliance/governance issues still require that data-sensitive analytics occur “behind the firewall”. Also, if you are an established enterprise with large volumes of data, you are not going to “upload” to the cloud for your analytics. With the Infochimps Enterprise Cloud, we provide public, virtual private, private, or hybrid big data cloud services that address the needs of big businesses with big problems.

Today, I’m pleased to announce the Infochimps Enterprise Cloud, our big data cloud running on a network of big data-focused data centers and being deployed by leading big data system integrators.

These are exciting times, indeed. Read the full press release here >>.

119efc1b cf09 4f4f 9085 057e76e0464c Announcing Infochimps Enterprise Cloud


Infochimps Plugged In To Gnip

Plugged In To Gnip Infochimps Plugged In To Gnip



Infochimps is pleased to announce our partnership with Gnip, one of the world’s largest and most trusted provider of social data, as part of their Plugged In To Gnip program.

Today, Gnip announced their new program that “enables Plugged In To Gnip partners to transparently showcase their access to the most reliable, comprehensive and sustainable social data in the world, creating the best possible experience for their customers”.  Read the full press release here >>

Why is Infochimps Plugged In To Gnip?
Getting a handle on the immense volume of data produced by the social networks provided by Gnip often requires a sophisticated data infrastructure for the processing and control of feeds.  As a partner in providing solutions to customers needing to extract insight from this treasure trove of data, Infochimps can help by setting up customers with a best in class data platform for refining and working with Gnip’s feeds.

Gnip powers social analytics solutions for some of the world’s largest Business Intelligence and Social Media Monitoring firms. They are a certified Twitter partner and exclusive provider of commercial access to public data from Tumblr, WordPress, StockTwits and Disqus.

To learn more about Plugged In To Gnip, visit

6fefa857 2e95 4742 9684 869168ac7099 Infochimps Plugged In To Gnip

Announcing Ironfan v4: Multicloud Capabilities + Community Support

Ironfan Announcing Ironfan v4: Multicloud Capabilities + Community Support Ironfan, the groundwork of the Infochimps Platform, is a systems provisioning, deployment, and updating tool that is built from a combination of proprietary technologies and open-source technologies like Chef and Fog.

After several proof-of-concepts and forks, hampered by the lack of underlying abstractions, we are happy to announce true multicloud capabilities for Ironfan. These capabilities update the current version to a largely similar feature-set to core Ironfan v3 (i.e. EC2 only). The current version is also ready for new providers, and VMware is working on catching up their fork of Ironfan, Serengeti, to use the latest code. This latest version has been undergoing heavy development and testing, including increasing third-party contributions, and we have an increased internal focus on expanding and hardening both the cookbooks and the knife plugin.

Interested in our growing community? Please join our new mailing list, managed by Nick Marden from GetSatisfaction. We’d love to hear your feedback!

34523bb2 2e50 4f42 88a1 5bd9ed0fddac Announcing Ironfan v4: Multicloud Capabilities + Community Support

Next Gen Real-time Streaming with Storm-Kafka Integration

At Infochimps, we are committed to embracing cutting edge technology, while ensuring that the latest Big Data innovations are enterprise-ready. Today, we are proud to deliver on that promise once again by announcing the integration of Storm and Kafka into the Cloud::Streams component of the Infochimps Cloud.

StormKafka 1024x578 Next Gen Real time Streaming with Storm Kafka Integration

Cloud::Streams provides solutions for challenges involving:

  • Large-scale data collection – clickstream web data, social media and online monitoring, financial market data, machine-to-machine data, sensors, business transactions, listening to or polling application APIs and databases, etc.
  • Real-time stream processing – real-time alerting, tagging and filtering, real-time applications, fast analytical processing like fraud detection or sentiment analysis, data cleansing and transformation, real-time queries, distribution to multiple clients, etc.
  • Analytics system ETL – providing normalized/de-normalized data using customer-defined business logic for various analytics data stores and file systems including Hadoop HDFS, HBase, Elasticsearch, Cassandra, MongoDB, PostgreSQL, MySQL, etc.

Storm and Kafka

Recently in my guest blog post on TechCrunch, I mentioned why you should care about Storm and Kafka.

“With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.”

Ultimately, Storm and Kafka form the best enterprise-grade real-time ETL and streaming analytics solution on the market today. Our goal is to put the same technology that Twitter uses to process over 400 million tweets per day — in your hands. Other companies that have adopted Storm in production include Groupon, Alibaba, The Weather Channel, FullContact, and many others.

Nathan Marz, Storm creator and senior Twitter engineer, comments on Storm’s rapid growth:

“Storm has gained an enormous amount of traction in the past year due to its simplicity, robustness, and high performance. Storm’s tight integration with the queuing and database technologies that companies already use have made it easy to adopt for their stream computing needs.”

Storm solves a broad set of use cases, including “processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more.”

Apache Kafka, which was developed by LinkedIn to power its activity streams, provides an additional reliability guarantee, robust message queueing, and distributed publish-subscribe capabilities.


Cloud::Streams is fault-tolerant and linearly scalable, and performs enterprise data collection, transport, and complex in-stream processing. In much the same way that Hadoop provides batch ETL and large-scale batch analytical processing, Cloud::Streams provides real-time ETL and large-scale real-time analytical processing — the perfect complement to Hadoop (or in some cases, what you needed instead of Hadoop).

Cloud::Streams adds important enterprise-class enhancements to Storm and Kafka, including:

  • Integration Connectors to your existing tech environment for collecting required data from a huge variety of data sources in a way that is robust yet as non-invasive as possible
  • Optimizations for highly scalable, reliable data import and distributed ETL (extract, transform, load), fulfilling data transport needs
  • Developer Toolkit for rapid development of decorators, which perform the real-time stream processing
  • Guaranteed delivery framework and data failover snapshots to send processed data to analytics systems, databases, file systems, and applications with extreme reliability
  • Rapid solution development and deployment, along with our expert Big Data methodology and best practices

Infochimps has extensive experience implementing Cloud::Streams, both for clients and for our internal data flows including large-scale clickstream web data flows, massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and more.

Obviously, data failover and optimizations are key to enterprise readiness. Above and beyond that though, Cloud::Streams is a joy to work with because of its flexible Integration Connectors and the Developer Toolkit. No matter where your data is, you can access and ingest it with a variety of input methods. No matter what kind of work you need to perform (parse, transform, augment, split, fork, merge, analyze/process, …) you can quickly develop that processor unit, test it, and deploy it as a Cloud::Streams decorator.

One of our most recent customers was able to build an entire production application flow for large-scale social media data analysis using the Infochimps Cloud development framework in just 30 days with only 3 developers. That is both unheard of from an enterprise timeline perspective, as well as an amazing case of business ROI. Big Data is too important to spend months and months developing. Your business needs results now, and the Infochimps Cloud leverages the talent you have today for fast project success.

How much is it worth to you to launch your own revenue generating applications for your customers? Or for your internal stakeholders as part of a Big Data business intelligence initiative? How much value would launching 12 months sooner provide your organization? These are questions which we’re trying to make the answer to obvious.

Steve Blackmon, Director of Data Sciences at W2O Group, explains why they are working with Infochimps and Cloud::Streams:

“Storm and Kafka are excellent platforms for scalable real-time data processing. We are very pleased that Infochimps has embraced Storm and Kafka for Cloud::Streams. This new offering gives us the opportunity to supplement our listening and analytics products with Infochimps’ data sources, to integrate capabilities seamlessly with our partners who also use Storm, and to retain Infochimps’ unique technical team to support and optimize our data pipelines.”

More Information

Check out the full press release here, including quotes from CEO Jim Kaskade and co-founder and CTO Flip Kromer.

You can access additional resources from the Cloud::Streams web page or our general resources directory.

Lastly, check out our previous product announcements! In February, we launched the Infochimps Platform. In April we launched Dashpot as well as our support of OpenStack. In August, we announced the Platform’s newest release.

6fefa857 2e95 4742 9684 869168ac7099 Next Gen Real time Streaming with Storm Kafka Integration

The Data Era – Moving from 1.0 to 2.0

“This post is from Jim Kaskade, the newly appointed CEO of Infochimps.  When we first met Jim, we were really impressed by him from multiple points of view.  His first questions about us were about our culture, something we pride ourselves on cultivating and would only want to work with an executive that shared the same concern.  Second, his understanding of the market and technological solutions matched, and in some areas exceeded, our own.  Third, Jim brings true leadership and CEO experience to the table, having been an executive and leading a number of startups in the past after a career at Teradata. We are truly excited to have Jim aboard and look forward to working together for many years!”

-Flip Kromer, Dhruv Bansal, and Joseph Kelly, co-founders of Infochimps

Do you think they truly understood just how fast the data infrastructure marketplace was going to change?

That is the question that comes to mind when I think about Donald Feinberg and Mark Beyer at Gartner who, last year, wrote about how the data warehouse market is undergoing a transformation. Did they, or anyone for that matter, understand the significant change underway in the data center? I describe it as Big Data 1.0 versus Big Data 2.0.

Big Data 1.0

1 The Data Era – Moving from 1.0 to 2.0

I was recently talking to friends at one of our largest banks about their Big Data projects under way. In less than one year, their Hadoop cluster has already far exceeded their Teradata enterprise data warehouse in size.

Is that a surprise? Not really. When you think about it, a traditionally large data warehouse is always in the terabytes, not petabytes (well, unless you are eBay).

With the current “Enterprise Data Warehouse” (EDW) framework (shown here) we will always see the high-value structured data in the well-hardened, highly available and secure EDW RDBMS (aka Teradata).

In fact, Gartner defines a large EDW starting at 20TB. This is why I’ve held back from making comments like, “Teradata should be renamed to Yottadata.” After all, it is my “alma mater” after having spent 10 years learning Big Data 1.0 there. I highly respect the Teradata technology and more importantly the people.

Big Data 2.0

So with over two zettabytes of information being generated in 2012 alone, we can expect more “Big Data” systems to be stood up, new breakthroughs in large dataset analytics, and many more data-centric applications being developed for businesses.

2 The Data Era – Moving from 1.0 to 2.0

However, many of the “new systems” will be driven by “Big Data 2.0” technology. The enterprise data warehouse framework itself doesn’t change much. However, there are many, many new players – mostly open source, who have entered the scene.

Examples include:

  • Talend for ETL
  • Cloudera, Hortonworks, MapR for Hadoop
  • SymmetricDS for replication
  • Hbase, Cassandra, Redis, Riak, Elastic Search, etc. for NoSQL / NewSQL data stores
  • ’R’, Mahout, Weka, etc. for machine learning / analytics
  • Tableau, Jaspersoft, Pentaho, Datameer, Karmasphere, etc. for BI

These are so many new and disruptive technologies, each contributing to the evolution of the enterprise’s data infrastructure.

I haven’t mentioned one of the more controversial statements made in adjacent graphic – Teradata is becoming a source along side the new pool of unstructured data. Both the new and the old data are being aggregated into the “Big Data Warehouse”.

We may also be seeing much of what Hadoop does in ETL feeding back into the EDW. But I suspect that this will become less significant as compared to the new analytics architecture with Hadoop + NoSQL/NewSQL data stores at the core of the framework – especially as this new architecture becomes more hardened and enterprise class.

Infochimps’ Big Data Warehouse Framework

3 The Data Era – Moving from 1.0 to 2.0

This leads us to why Infochimps is so well positioned to make a significant impact within the marketplace.

By leveraging four years of experience and technology development in cloud-based big data infrastructure, the company is now offering a suite of products that contribute to each part of Big Data Warehouse Framework for enterprise customers.

DDS: With Infochimps’ Data Delivery Services (DDS), our customer’s application developers do not rely on sophisticated ETL tools. But rather, they can manipulate data streams of any volume or velocity using DDS through a simple developer-friendly language, referred to as Wukong. Wukong turns application developers into data scientists.

Ingress and egress can be handled directly by the application developer, uniquely bridging the gap between them and their data.

Wukong: Wukong is much more than a data-centric domain specific language (DSL). With standardized connectors to analytics from ‘R’, Mahout, Weka, and others, not only is data manipulation made easy, integration of sophisticated analytics with the most complicated data sources is also made easy.

Hadoop & NoSQL/NewSQL Data Stores: At the center of the framework, is not only an elastic and cloud-based Hadoop stack, but a selection of NoSQL/NewSQL data stores as well. This uniquely positions Infochimps to address both decision support-like workloads, which are complex and batch in nature, with OLTP or more real-time workloads as well. The complexities of standing up, configuring, scaling, and managing these data stores is all automated.

Dashpot: The application developer is typically left out with many of the business intelligence tools offered today. This is because most tools are extremely powerful and built for special groups of business users / analysts. Infochimps has taken a slightly different approach, staying focused on the application developer. Dashpot is a reporting and analytics dashboard which was built for the developer – enabling quick iteration and insights into the data, prior to production and prior to the deployment of more sophisticated BI tools.

Ironfan and Homebase: As the underpinning of the Infochimps solution, Ironfan and Homebase are the two solutions which essentially abstract any and all hardware and software deployment, configuration, and management. Ironfan is used to deploy the entire system into production. Homebase is used by application developers to create their end-to-end data flows and applications locally on their laptops or desktops before they are deployed into QA, staging, and/or production.

All-in-all Infochimps has taken a very innovative approach to enabling application developers with Big Data 2.0 technologies in a way that is not only comprehensive, but fast, simple, extensible, and safe.

Our vision for Infochimps leverages the power of Big Data, Cloud Computing, Open Source, and Platform as a Service – all extremely disruptive technology forces. We’re excited to be helping our customers address their mission critical questions, with high impact answers. And I personally look forward to executing on our vision to provide the simplest yet most powerful cloud-based and completely managed big data service for our enterprise customers.

blog platform demo v21 The Data Era – Moving from 1.0 to 2.0

Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

An old 1978 ad slogan from Scrubbing Bubbles stated that, “We work hard so you don’t have to” – essentially promising customers that they would take care of the “dirty work” and let the customer reap the benefit of the clean, finished product.

The same holds true today here at Infochimps. Our mission is to do the heavy lifting and seamlessly handle your big data implementations –removing the requirement for expensive integration or specialists–allowing you to focus on generating insights from data, not managing Big Data infrastructure. We provide you the insights you need to make data-driven decisions, speed your application development, and, ultimately, improve your operational efficiencies and time to market.

We are proud to announce today the latest version of our Big Data Platform, a managed, fully optimized and hosted service for deploying Big Data environments and apps in the cloud, on-demand.

Key new features include:

New Data Delivery Service

  • Based on the open source Apache Flume project
  • Integrates with your existing data environment and data sources with a combination of out-of-the-box and custom connectors
  • High scalability and optimization of distributed ETL (Extract, Transform, Load) processes, able to handle many terabytes of data per day
  • Both distributed real-time analysis and distributed Hadoop batch analysis

Real-time, Data Streaming Framework

  • You can use familiar programming languages such as Ruby to vastly simplify performing real-time analytics and stream processing, all within the Infochimps Data Delivery Service
  • Extends Infochimps’ Wukong open source project, which lets developers use Ruby micro-scripts to perform Hadoop batch processing

We’ve bundled together everything you need to install the platform, making it faster than ever to get a big data project off the ground — a configured solution can be deployed in just a few hours.

The Infochimps Platform is capable of executing on hundreds of data sources and many terabytes of data throughput, delivering scalability to any type and quantity of database, file system or analytic system.

Check out our other new features in today’s press release.

Also, read GigaOm’s take on our news here.

blog platform demo v21 Information. Insight. Instantly: Check Out The Latest Version Of Our Big Data Platform!

Homepage Redesign – Check It Out!

new homepage 072012 Homepage Redesign   Check It Out!


We gave the homepage of a little facelift this week to better showcase all the awesome things we are working on now.  After hearing feedback from many prospective and current customers, the revamp includes clearer articulation of the benefits and process of working with the Infochimps Platform, features our Platform tour, and spotlights our free demo.

Take a gander and let us know what you think!

Hadoop in the Cloud – Infochimps and VMware

logo Hadoop in the Cloud   Infochimps and VMware

Infochimps is proud to be a part of a new effort launched today by VMware to enable big data applications running on Hadoop to be deployed more easily on top of virtual and cloud-based IT environments. The Serengeti project, released today under the Apache 2.0 license, is built upon a number of open source technologies including our own Ironfan tool and supports all major Hadoop distributions including Cloudera, Greenplum, Hortonworks, and MapR.

Ironfan is the foundation of the Infochimps Platform and the basis of our customers’ Big Data deployments. It makes provisioning and configuring Big Data infrastructure simple – you can easily spin up clusters when you need them and kill them when you don’t, so our customers can spend their time, money, and engineering focus on finding insights, not configuring and deploying machines. Ironfan is quickly becoming the number one deployment tool for Hadoop platforms in the cloud, and this endorsement by VMware and inclusion in Serengeti is further evidence of the popularity of the tool.

What does Serengeti mean for Infochimps users?
From the beginning, the Infochimps Platform has been built on a foundation of open source tools for managing data that simplify the experience of working with complex technologies such as Hadoop. Within the Infochimps Platform, Ironfan, as well as other tools like Wukong and Swineherd, are major open sourced components of the stack. And with our enterprise tools including Data Delivery Service and Dashpot, customers can deploy complete Big Data environments and be assured of highly reliable delivery of data to their Hadoop environments.

The Serengeti project supports our open source tradition with its strong open source foundation and support by all of the major Hadoop distributions. Within the Serengeti project, Ironfan enables users to quickly and easily configure and deploy Hadoop clusters on top of VMware vSphere® in minutes with a single command. Now, users running VMware’s virtual and cloud infrastructure can more easily take advantage of the power of Hadoop as well as other Big Data technologies like the Infochimps Data Delivery Service, Dashpot, and Infochimps big data expertise to manage, process, and analyze massive amounts of unstructured, semi-structured, or structured data at scale and in the cloud.

We’re excited to be included in Serengeti and look forward to working with VMware customers and partners as they further their use of Big Data technologies.

Interested in learning more about Infochimps, VMware, and Serengeti? Contact us today for more information!

Announcing Support for OpenStack and the Rackspace Cloud

Infochimps is happy to announce that we now support the next generation Rackspace Cloud, based on OpenStack. Through integration with the OpenStack API the Infochimps Platform can now power big data applications based in the Rackspace Cloud, expanding the reach of the Infochimps Platform and making the running of complex big data infrastructures quick and easy for a broader range of users.

Rackspace customers running the new OpenStack-based Rackspace Cloud Servers can quickly and easily spin up Hadoop clusters to power their big data applications in as little as 20 minutes with a single command using the Infochimps Platform. With the power of Ironfan, Infochimps’ open source provisioning tool, and Dashpot, Infochimps’ visualization and operations dashboard, customers can easily monitor and manage their Big Data operations on an ongoing basis, or leave it to Infochimps to manage it on the Rackspace Cloud for them.

Check out this demo of Infochimps Platform running in the Rackspace Cloud:

Why OpenStack and Rackspace?
From the beginning, the Infochimps Platform has been built on a foundation of open source tools for managing data, aimed at simplifying the experience of working with complex technologies such as Hadoop or Cassandra. Within the Infochimps Platform, Wukong, Ironfan and Swineherd are major open sourced components of the stack. OpenStack supports our open source tradition with its strong open source ecosystem. It is used by and contributed to by not only Rackspace, but organizations such as NASA, Canonical, RedHat, Dell, HP, and AT&T, so its architecture serves a multitude of needs, rather than bending to the whims of a single provider.

OpenStack also encourages standardization among Infrastructure as a Service providers, which ultimately benefits everyone in the market. Clients can make (and remake) decisions based on their businesses’ current day to day needs, without needing to employ a crystal ball to try to predict which provider will be best for them in the long term. By sharing open and standard interfaces, cloud providers can compete on current quality and value, instead of fighting to lock-in customers based on promises.

The modular design of OpenStack is part of what makes standards possible without blocking innovation. There are a set of core APIs that every provider will support, and extensions for added capabilities that not every provider will want to allow. The contracts these APIs provide can be (and often are) fulfilled by different back-end providers, letting each provider make different architectural choices without requiring customers to completely retool to take advantage of them. All of this allows apples-to-apples comparison of provider architectures, without making orange sales impossible.

What does OpenStack mean for Infochimps?
The work we’ve done to support this announcement has enabled us to provide a level of abstraction from the Amazon Web Services environment, and we can deploy our platform in a cloud agnostic way. Many of our customers have asked for implementations on their in-house cloud environments – our OpenStack support allows those implementations to be airlifted in using a common set of APIs that sit on top of whatever infrastructure already exists, instead of one-off installations that require more custom development and introduce brittleness.

Interested in learning more about Infochimps, Rackspace, and OpenStack? Contact us today for more information!