Announcements

[New Whitepaper] Real-Time Data Aggregation

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

StormKafka1 e1366923782399 [New Whitepaper] Real Time Data Aggregation

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

In this whitepaper, you’ll learn:

  • Definitions & History – batch processing, stream processing
  • Comparison of Stream vs. Batch for Selected Use Cases – includes industry use case: aviation
  • Why Cloud::Streams is the leading stream processing framework

DOWNLOAD1 [New Whitepaper] Real Time Data Aggregation




229fa9b4 2ea6 4535 8a80 e041d110204c [New Whitepaper] Real Time Data Aggregation



Infochimps Recognized in Inaugural Big Data 100 List

CRN Big Data 100 Infochimps Recognized in Inaugural Big Data 100 ListInfochimps is proud to be named among UBM Tech Channel’s CRN 2013 Big Data 100 list, developed by the CRN editorial team to include “vendors that have demonstrated an ability to innovate in bringing to market products and services that help businesses manage Big Data.” The list consists of 3 categories: business analytics, data management, and infrastructure and services.

Infochimps was named within the Big Data infrastructure and services category – identified as 1 out of 25 “IT vendors who can do it all, from data storage hardware and software, to management tools, to business analytics.” We are proud to be recognized alongside other innovative companies such as Amazon Web Services, Oracle, and Rackspace.

Thank you CRN for understanding the struggle with increasing volume, speed and variety of information being generated today; identifying Infochimps Enterprise Cloud as a solution to help companies address their Big Data needs.




229fa9b4 2ea6 4535 8a80 e041d110204c Infochimps Recognized in Inaugural Big Data 100 List




Image Source: CRN

CIOs & Big Data: What IT Teams Want Their CIOs to Know

It’s no secret that enterprises today face an increasingly competitive and erratic global business environment, and that Big Data is more than just another IT project – it’s truly a finger on the pulse of the business. To say that in 2013 Big Data is “mission critical” is to put it mildly – organizations that ignore the insights that Big Data can deliver are flying blind. So, it is all the more disconcerting that 55% of Big Data projects don’t get completed, and many others fall short of their objectives.

In order to understand the reasons for this, Infochimps partnered with SSWUG.org, one of the largest enterprise technology-focused, community-driven sites and a source for answers to IT-related questions and professional growth for more than 570,000 members. Together we got survey responses from over 300 IT department staffers – 58% of whom have current Big Data projects underway – on what they most wanted their CIOs to know about the process of implementing Big Data projects.

Read the full report here. >>

Key findings are summarized in the following infographic:
SurveyInfographic Final CIOs & Big Data: What IT Teams Want Their CIOs to Know

While the findings reveal many reasons for Big Data project failure, undoubtedly one of the biggest factors is lack of communication between top managers, who provide the overall project vision, and the data scientist and other IT staff charged with actually implementing it. Far too frequently their opinions are taken as an afterthought, and consequently considered only when projects veer off-course.

Given the stakes, it’s imperative that CIOs have a 360-degree view of all that a Big Data project will involve – not just the various Big Data technologies that are so frequently at the forefront of Big Data discussions.

The insight we gleaned reveals much about both enterprise technology and enterprise culture. In order for companies to succeed with Big Data, executives will need to rethink long-held notions of how diverse departments should function together. In the past “breaking down silos” was a nice mantra. Now, it is imperative. Additionally, CIOs and other enterprise executives may find it necessary to educate their organizations on the advantages of new Big Data applications and processes that will give them better customer insights, make their jobs infinitely easier and give their departments the elasticity needed to meet virtually any business need in real-time.

We hope this report will serve not only as a source of insight, but also be a reminder to seek the invaluable perspective of IT staff as early as possible in the process of developing new, technology-intensive projects.

Read the press release here. >>

 

A Sneak Preview: Big Data for Chimps, The Book

  • Amanda McGuckin Hager

Big Data for Chimps A Sneak Preview: Big Data for Chimps, The BookI’ve been reading Flip’s book, Big Data for Chimps: A Guide to Massive Scale Data Processing, available for pre-order now from O’Reilly. While I’m no data engineer, I am able to follow along. After reading a bit, it comes as no surprise that Flip helped to found Infochimps with the philosophy of making the world’s knowledge accessible to anyone.  The content is unexpected and engaging. Take, for example, the story of Chimpanzee and Elephant Start a Business, from The Stream Chapter:

Chimpanzee and Elephant Start a Business

As you know, chimpanzees love nothing more than sitting at typewriters processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. The chimpanzees and the elephants realized there was a real business opportunity from combining their strengths, and so they formed the Chimpanzee and Elephant Data Shipping Corporation. They were soon hired by a publishing firm to translate the works of Shakespeare into every language. In the system they set up, each chimpanzee sits at a typewriter doing exactly one thing well: read a set of passages, and type out the corresponding text in a new language. Each elephant has a pile of books, which she breaks up into “blocks” (a consecutive bundle of pages, tied up with string).

Read the full chapter (available here: The Stream Chapter) to understand how this example, combined with pig latin, simple streamers, and running Hadoop jobs have to do with each other. You’ll also get two exercises and a Ruby helper section containing tips and tricks.

Amanda McGuckin Hager is a high-tech marketing professional with over 17 years of experience focused on driving demand through strategic marketing programs and is the Director of Marketing at Infochimps. Follow Amanda on Twitter.




817e847c d61d 4d47 88ba 577f69b4e780 A Sneak Preview: Big Data for Chimps, The Book



Infochimps CTO Named Top 100 Contributors to GitHub 2012

Github Infochimps CTO Named Top 100 Contributors to GitHub 2012Flip Kromer, Infochimps Founder and CTO, also known as MrFlip, was named by GitHub as one of the Top 100 Contributors in 2012. Flip made over 2,300 contributions to the global, open source developer community.

And he’s in good company. Also on the list are: Linus Torvals of Linux, Erik Michaels-Ober, and Dr. Nic Williams.

In addition to being a prolific code contributor and one of the nations’ leading data scientists, Flip is the author of Big Data for Chimps, A Guide to Massive Scale Data Processing, published by O’Reilly, and available for pre-order now.

About GitHub: Github, a Forbes’ Top Tech Company of 2012 and the largest code host in the world, was founded in 2008 and is leading enterprises to adopt open source technology. Github, known for social coding, was founded as a place for developers to code together, as teams and individuals.

About Infochimps: The Infochimps Platform for Big Data combines leading data technologies with managed cloud services, a strong partner network to empower customers with unprecedented speed, scale and flexibility in their Big Data initiatives. Infochimps is a privately held, venture-backed company with offices in Austin, TX and the Silicon Valley. Follow @infochimps on Twitter.




1edf4f3a 3033 47f8 8b9c d110c666f0fa Infochimps CTO Named Top 100 Contributors to GitHub 2012



A Sneak Peek: Big Data for Chimps

  • Amanda McGuckin Hager

Big Data for Chimps 228x300 A Sneak Peek: Big Data for ChimpsYou may know leading data scientist, Flip Kromer, Infochimps co-founder and CTO. If you don’t, you soon will. O’Reilly is publishing his book “Big Data for Chimps, A Guide to Massive Scale Data Processing in Practice available for pre-order now. “Big Data for Chimps” is poised to bring an educational spin to those in the big data space that is unlike anything you may have read before. While beginners stand to gain quite a bit from reading the book, the book also appeals those experienced in modern programming techniques. Flip’s approach to technology builds the foundation throughout – using data at massive scale in very practical ways. That is, big data is about the data, and gaining value from it, not about the technologies.

Big Data for Chimps” will help you:

  • Discover how to think at scale by understanding how data must flow through the cluster to effect transformations
  • Identify the Big Data Infrastructure tuning knobs that matter
  • Learn the Big Data rules-of-thumb
  • Apply Hadoop to interesting problems through detailed example programs
  • Gain advice and best practices for efficient software development

You will be captivated and engaged through Flip’s creative use of examples, analogies and stories. When you read the book, you’ll experience: “Where is BBQ?,” “Pig Latin Translator,” “Patterns in UFO Sitings,” “Elephant and Chimpanzee Start a Business,” and more.

The chapter “Elephant and Chimpanzee Save Christmas” is especially commanding of your attention. Watch for it in January, as we’ll be releasing some chapters leading up to the publication.

Happy Holidays from all of us at Infochimps.  As our gift to you, here’s a little sneak peek at what’s in store, called “The Hadoop Haiku:”

data flutters by
elephants make sturdy piles
insight shuffles forth




47f18564 d70f 4a11 b8e3 f59ec64f85aa A Sneak Peek: Big Data for Chimps




 

Announcing Infochimps Enterprise Cloud

Infochimps Enterprise Cloud Announcing Infochimps Enterprise Cloud

Big Data is confusing to most executives. It’s this nebulous concept of applying technologies from Yahoo!, Facebook, Linkedin, and Twitter in such a way that the organization will truly become data-driven and, equally as important, be able to do so quickly. Unfortunately, only a few companies are really realizing its full potential.

That’s why Infochimps is announcing its Enterprise Cloud – A Big Data cloud service built specifically for Fortune 1000 enterprises who want to rapidly explore how big data technology can unlock revenue from their data. The Infochimps Enterprise Cloud addresses several challenges holding back executives from quickly gaining value from this disruptive technology.

Enterprises are only leveraging 15% of their data assets

Enterprises, on average, capture and analyze about 15% of their data assets. Typical data sources include transactional data (who bought what). However, a 360-degree view of the business requires a 360-degree view of the customer, as well as manufacturing, supply chain, finance, sales, marketing, engineering, etc.  Only by capturing 100% of the enterprise’s entire operational data and then supplementing it with external data (e.g. we’re talking to one pharmaceutical company about using claims data from 100+ health plans covering more than 70 million people), will you achieve maximum value from your data analytics. With the Infochimps Enterprise Cloud, you can not only combine 100% of your private data in a private cloud, but you can also supplement that data with another 100%+ of external data.

Time-To-Market constrained by infrastructure deployments

The deployment of, and value creation from, new disruptive big data technologies (Hadoop, NoSQL, in-stream processing) still takes a considerable amount of time, human and financial resources. Typical Enterprise Data Warehouse projects take 18-24 months to deploy. Simple changes to star-schema data models take 6 months minimum to be made available to internal development organizations. Hadoop projects, although less complicated than EDW, take about 12 months to deploy. With the Infochimps Enterprise Cloud, you can deploy value in 30 days.

Big Data talent hard to find

When I read articles about the gap between supply and demand for big data talent, I think to myself, “this is not a situation where analysts are collecting a sample of 10 companies and then generalizing it to the entire market.” It’s a real problem. If you are some “antiquated” Fortune 1000 company (you know who you are) looking to hire crazy smart engineers and data scientists from Facebook…well, sorry…you don’t have the corporate culture or the exciting environment that this talent enjoys. McKinsey forecasts that the demand and supply of talent needed is only going to get worse (60% gap by 2018). With the Infochimps Enterprise Cloud, you can leverage your existing talent. This is done by providing a simple but powerful abstraction between your application development team and the complex big data infrastructure.

One Big Data technology does not fit all

There are literally hundreds of DBMS / data store solutions today, supporting many different advantages based on data type and use-case. This creates the problem where business users and application developers get lost in the nuances associated with data infrastructure, and lose focus on the business needs. Don’t listen to a single data store vendor tell you that they can address all your business needs. You need several. With the Infochimps Enterprise Cloud, we force you to start with the business problem first, then we draw from a very comprehensive data services layer which addresses the needs of the business problem. Guess what? It’s not just Hadoop.

Infrastructure and data integration is the most challenging

Knowing how to integrate existing data infrastructure with new big data infrastructure and then complicating this with external data sources, makes integration a completely new problem. This is not a matter of simply upgrading your ETL tools. With the Infochimps Enterprise Cloud, we help you understand the “new ETL” used by our web-scale friends.

Open source is cheap, but not easily commercialized

Silicon Valley has created over 250,000 open source projects alone. Disruption is obviously occurring within the open source community. However, enterprises are not in a position to properly deploy, even with the many commercialization vendors. How does a company integrate several open-source solutions into one? With the Infochimps Enterprise Cloud, we support an end-to-end big data service, which consists of many commercial open source projects combined to offer real-time stream processing, ad-hoc analytics, and batch analytics as one integrated data service.

Data security + data volume both dictate deployment options

Only non-sensitive, publicly available data sets (e.g. Twitter) are using elastic public cloud infrastructure. Compliance/governance issues still require that data-sensitive analytics occur “behind the firewall”. Also, if you are an established enterprise with large volumes of data, you are not going to “upload” to the cloud for your analytics. With the Infochimps Enterprise Cloud, we provide public, virtual private, private, or hybrid big data cloud services that address the needs of big businesses with big problems.

Today, I’m pleased to announce the Infochimps Enterprise Cloud, our big data cloud running on a network of big data-focused data centers and being deployed by leading big data system integrators.

These are exciting times, indeed. Read the full press release here >>.




119efc1b cf09 4f4f 9085 057e76e0464c Announcing Infochimps Enterprise Cloud




 

Infochimps Plugged In To Gnip

Plugged In To Gnip Infochimps Plugged In To Gnip

 

 

Infochimps is pleased to announce our partnership with Gnip, one of the world’s largest and most trusted provider of social data, as part of their Plugged In To Gnip program.

Today, Gnip announced their new program that “enables Plugged In To Gnip partners to transparently showcase their access to the most reliable, comprehensive and sustainable social data in the world, creating the best possible experience for their customers”.  Read the full press release here >>

Why is Infochimps Plugged In To Gnip?
Getting a handle on the immense volume of data produced by the social networks provided by Gnip often requires a sophisticated data infrastructure for the processing and control of feeds.  As a partner in providing solutions to customers needing to extract insight from this treasure trove of data, Infochimps can help by setting up customers with a best in class data platform for refining and working with Gnip’s feeds.

Gnip powers social analytics solutions for some of the world’s largest Business Intelligence and Social Media Monitoring firms. They are a certified Twitter partner and exclusive provider of commercial access to public data from Tumblr, WordPress, StockTwits and Disqus.

To learn more about Plugged In To Gnip, visit gnip.com/plugged_in.




6fefa857 2e95 4742 9684 869168ac7099 Infochimps Plugged In To Gnip



Announcing Ironfan v4: Multicloud Capabilities + Community Support

Ironfan Announcing Ironfan v4: Multicloud Capabilities + Community Support Ironfan, the groundwork of the Infochimps Platform, is a systems provisioning, deployment, and updating tool that is built from a combination of proprietary technologies and open-source technologies like Chef and Fog.

After several proof-of-concepts and forks, hampered by the lack of underlying abstractions, we are happy to announce true multicloud capabilities for Ironfan. These capabilities update the current version to a largely similar feature-set to core Ironfan v3 (i.e. EC2 only). The current version is also ready for new providers, and VMware is working on catching up their fork of Ironfan, Serengeti, to use the latest code. This latest version has been undergoing heavy development and testing, including increasing third-party contributions, and we have an increased internal focus on expanding and hardening both the cookbooks and the knife plugin.

Interested in our growing community? Please join our new ironfan-users@infochimps.com mailing list, managed by Nick Marden from GetSatisfaction. We’d love to hear your feedback!

34523bb2 2e50 4f42 88a1 5bd9ed0fddac Announcing Ironfan v4: Multicloud Capabilities + Community Support

Next Gen Real-time Streaming with Storm-Kafka Integration

At Infochimps, we are committed to embracing cutting edge technology, while ensuring that the latest Big Data innovations are enterprise-ready. Today, we are proud to deliver on that promise once again by announcing the integration of Storm and Kafka into the Cloud::Streams component of the Infochimps Cloud.

StormKafka 1024x578 Next Gen Real time Streaming with Storm Kafka Integration

Cloud::Streams provides solutions for challenges involving:

  • Large-scale data collection - clickstream web data, social media and online monitoring, financial market data, machine-to-machine data, sensors, business transactions, listening to or polling application APIs and databases, etc.
  • Real-time stream processing - real-time alerting, tagging and filtering, real-time applications, fast analytical processing like fraud detection or sentiment analysis, data cleansing and transformation, real-time queries, distribution to multiple clients, etc.
  • Analytics system ETL - providing normalized/de-normalized data using customer-defined business logic for various analytics data stores and file systems including Hadoop HDFS, HBase, Elasticsearch, Cassandra, MongoDB, PostgreSQL, MySQL, etc.

Storm and Kafka

Recently in my guest blog post on TechCrunch, I mentioned why you should care about Storm and Kafka.

“With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.”

Ultimately, Storm and Kafka form the best enterprise-grade real-time ETL and streaming analytics solution on the market today. Our goal is to put the same technology that Twitter uses to process over 400 million tweets per day — in your hands. Other companies that have adopted Storm in production include Groupon, Alibaba, The Weather Channel, FullContact, and many others.

Nathan Marz, Storm creator and senior Twitter engineer, comments on Storm’s rapid growth:

“Storm has gained an enormous amount of traction in the past year due to its simplicity, robustness, and high performance. Storm’s tight integration with the queuing and database technologies that companies already use have made it easy to adopt for their stream computing needs.”

Storm solves a broad set of use cases, including “processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more.”

Apache Kafka, which was developed by LinkedIn to power its activity streams, provides an additional reliability guarantee, robust message queueing, and distributed publish-subscribe capabilities.

Cloud::Streams

Cloud::Streams is fault-tolerant and linearly scalable, and performs enterprise data collection, transport, and complex in-stream processing. In much the same way that Hadoop provides batch ETL and large-scale batch analytical processing, Cloud::Streams provides real-time ETL and large-scale real-time analytical processing — the perfect complement to Hadoop (or in some cases, what you needed instead of Hadoop).

Cloud::Streams adds important enterprise-class enhancements to Storm and Kafka, including:

  • Integration Connectors to your existing tech environment for collecting required data from a huge variety of data sources in a way that is robust yet as non-invasive as possible
  • Optimizations for highly scalable, reliable data import and distributed ETL (extract, transform, load), fulfilling data transport needs
  • Developer Toolkit for rapid development of decorators, which perform the real-time stream processing
  • Guaranteed delivery framework and data failover snapshots to send processed data to analytics systems, databases, file systems, and applications with extreme reliability
  • Rapid solution development and deployment, along with our expert Big Data methodology and best practices

Infochimps has extensive experience implementing Cloud::Streams, both for clients and for our internal data flows including large-scale clickstream web data flows, massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and more.

Obviously, data failover and optimizations are key to enterprise readiness. Above and beyond that though, Cloud::Streams is a joy to work with because of its flexible Integration Connectors and the Developer Toolkit. No matter where your data is, you can access and ingest it with a variety of input methods. No matter what kind of work you need to perform (parse, transform, augment, split, fork, merge, analyze/process, …) you can quickly develop that processor unit, test it, and deploy it as a Cloud::Streams decorator.

One of our most recent customers was able to build an entire production application flow for large-scale social media data analysis using the Infochimps Cloud development framework in just 30 days with only 3 developers. That is both unheard of from an enterprise timeline perspective, as well as an amazing case of business ROI. Big Data is too important to spend months and months developing. Your business needs results now, and the Infochimps Cloud leverages the talent you have today for fast project success.

How much is it worth to you to launch your own revenue generating applications for your customers? Or for your internal stakeholders as part of a Big Data business intelligence initiative? How much value would launching 12 months sooner provide your organization? These are questions which we’re trying to make the answer to obvious.

Steve Blackmon, Director of Data Sciences at W2O Group, explains why they are working with Infochimps and Cloud::Streams:

“Storm and Kafka are excellent platforms for scalable real-time data processing. We are very pleased that Infochimps has embraced Storm and Kafka for Cloud::Streams. This new offering gives us the opportunity to supplement our listening and analytics products with Infochimps’ data sources, to integrate capabilities seamlessly with our partners who also use Storm, and to retain Infochimps’ unique technical team to support and optimize our data pipelines.”

More Information

Check out the full press release here, including quotes from CEO Jim Kaskade and co-founder and CTO Flip Kromer.

You can access additional resources from the Cloud::Streams web page or our general resources directory.

Lastly, check out our previous product announcements! In February, we launched the Infochimps Platform. In April we launched Dashpot as well as our support of OpenStack. In August, we announced the Platform’s newest release.

6fefa857 2e95 4742 9684 869168ac7099 Next Gen Real time Streaming with Storm Kafka Integration