A Chimpy Movember

This November, the chimps participated in Movember, the moustache growing charity event held each November that raises awareness and funds for men’s health. There was an objective, rules, and winners – the makings of a friendly competition while building company culture for a worthy cause.

The Objective: To raise money for men’s health through growing facial hair, asking friends and family for donations, and by joining together with other chimps in camaraderie and some good-hearted revelry.

The Rules:
1. You do not have to begin the month with a clean shaven face.
2. You must maintain a moustache continuously from the 15th to end of month.
3. You must end the month with a moustache, but not necessarily the same moustache you started with.
4. A moustache is not a beard. For example: There is no joining of the moustache to the sideburns.
5. A moustache is not a goatee. There is no joining of the handlebars to the chin.
6. Other facial hair is permitted.
7. Category winners are determined on November 30 by consensus vote of the Mo Sistas.
8. Chimpiest Mo is determined on whatever criteria the Mo Sistas agree to on November 30.
9. Each Mo shall conduct themselves as true country gentlemen; each Mo Sista shall conduct themselves as true city ladies.

Category Winners:
- The Chimpiest Mo – Travis Dempsey
- The Lamest Mo (for the follically challenged) – Joe Kelly
- The Most Styled Mo – Flip Kromer
(Moustache Memorabilia was awarded to the winners.)

Infochimps Movember1 A Chimpy Movember

(From Left to Right: Mo Sistas, Winning Mos, Miami Vice Mos)

Go Infochimps! We are proud to support Movember, raising awareness and funds for men’s health.

Just because it’s December, doesn’t mean you can’t support men’s health all year round. See the official Movember merchandise page for everything from posters to shoes, and like Movember USA on Facebook.

IE. Invites: Hadoop Innovation Summit

IE Group1 IE. Invites: Hadoop Innovation SummitStart preparing for your trip to San Diego for the largest gathering of Fortune 500 business executives leading Hadoop initiatives.

Hadoop Innovation Summit:  February 20-21, 2013, San Diego
Unlocking the Value of Big Data

The Hadoop Innovation Summit brings together business leaders and innovators for an event acclaimed for its interactive format; combining keynote presentations, interactive breakout sessions and open discussion. There will also be plenty of hands on demonstrations allowing you to make the most from industry leaders and vendors alike.

Hadoop Innovation will help your business understand how to unlock the value of Big Data with the realization that no data is too big. With a vast amount of data now available, modern businesses are faced with the challenge of making use from the data available and unlocking its true value.

Register by December 23, 2012 and save up to $700 on standard pass prices.

Register Online Today >>





34523bb2 2e50 4f42 88a1 5bd9ed0fddac IE. Invites: Hadoop Innovation Summit



[Infographic] Taming Big Data from Wikibon

Opening with a Big Data market forecast, to ending with a shout-out for all industries to embrace Big Data as the definitive source of competitive advantage, the following infographic from Wikibon personifies Big Data as a beast (data volumes are growing exponentially) that can be tamed (thanks to new approaches for processing, storing and analyzing).  It includes real-world Big Data use cases, which I appreciated. I was most amazed by how “decoding the human genome used to take ten years, but can now be done in 7 days.”

The quote from Kevin Weil, the Director of Product for Revenue at Twitter brings the benefit of valuable Big Data insights home: “It’s no longer hard to find the answer to a given question; the hard part is finding the right question and as questions evolve, we gain better insight into our ecosystem and our business.”

Scroll down, geek out on the infographic, and if you want more, check out an oldie but goodie article:  6 Illuminating Big Data Infographics

Taming Big Data [Infographic] Taming Big Data from Wikibon

Did you notice the chimp within the Big Data forecast?

Thank you Wikibon for posting this!





84493d0d e63a 4f96 ae8b 01f76694dc55 [Infographic] Taming Big Data from Wikibon



Successful Planning: Business Analytics Innovation Summit

Business Analytics Innovation Summit Successful Planning: Business Analytics Innovation Summit
Our partners at *IE. would like to introduce you to the exclusive Business Analytics Innovation Summit, January 30 – 31 in Las Vegas.

This Summit brings the leaders and innovators from the industry together for a summit acclaimed for its insight into business intelligence and analytics.

Effective business analytics is central to business success. In the modern business environment, technological developments and the advances of globalization have created unparalleled opportunities for businesses to expand their markets.

Register Now to Start Successful Planning Through Advanced Analytics





84493d0d e63a 4f96 ae8b 01f76694dc55 Successful Planning: Business Analytics Innovation Summit



Breaking Hadoop out to the Larger Market

There are a lot of people out there with a Terabyte problem but who lack a Petabyte problem — yet they are forced to try to make use of a stack developed to address Facebook, Yahoo and JP Morgans‘ Petabyte problem. Hadoop out of the box is oriented for achieving 100% utilization of fixed-sized clusters by 12, 50, 100+ person analytics teams. In contrast, the bulk of even forward-thinking enterprises are at the level of just having handed two PhD statisticians a copy of the elephant book, a mis-provisioned cluster, and a slap on the back with a directive to “go find us som’a that insight!”.

There are a few observations we’ve made about these other customers and their differentiated needs that I wanted to share, and point to how we seek to address these with our own product.

Our first major observation is that while Hadoop might headline the bill, streaming data delivery is the opening act that moves the most merchandise.  Most of our customers on initial contact mention Hadoop by name — yet universally the first-delivered and most necessary component has been streaming data delivery into a scalable database and/or Hadoop.

In fact, we’ve had clients who excitedly purchased and setup a Hadoop cluster, and they had plenty of data they’d like to analyze, but had no data in their Hadoop cluster. It may seem obvious once pointed out that you need a way to feed data into your cluster. Enter modern open source tools such as Flume and Storm.  Indeed, Flume was originally created to feed hungry Hadoop clusters with streaming log data.

What people are now realizing though is just how powerful streaming data delivery tools like these are — that you can realize a surprising amount of analytical power (and even visibility of data as well) while the data is still in flight. These realizations have driven the accelerated adoption of many of these open source streaming technologies, like Esper, Flume, and Storm. I’ve been using Hadoop since ’08, and the adoption demand of Storm outpaces even Hadoop’s ascent.

Another important feature set we evangelize and see validated is what an underlying cloud infrastructure enables for the enterprise.  Cloud-enabled elasticity makes exploratory analytics transformatively more powerful, as companies can scale their infrastructure up and down as needed.

Contrasted to the Petabyte-companies, who focus on 100% cluster utilization, the target metric for a development cluster fit for the Terabyte-company is high downtime — the ability to go from 10 to 100 machines; back down to 10; then rest to 0 machines over the course of a job. Hadoop out-of-the-box doesn’t meet this target, which was one of the most interesting engineering challenges we’ve solved.

So where else does the cloud fit in the Hadoop use case? Being able to safely grow, shrink, and stop/restart Hadoop isn’t just a slider UX control, it’s a fundamental change in developer mindset and capabilities. For example, when we were a 6-person team with an AWS bill that rivaled our payroll, we would run parse stages of jobs on high CPU instances, then slam it shut mid-workflow and bring the cluster up on high memory instances for the graph-heavy stages. As our platform matured, we moved to giving each developer their own cluster; too often Chimp A needed 30 machines for 2 hours, while Chimp B needed 6 machines all day. Most companies would have to compromise with a 30-machine cluster running all day – we’ve been able to reject that approach.

Hadoop Elastic Context 300x239 Breaking Hadoop out to the Larger MarketTuning a Hadoop job to your cluster is fiendishly difficult and time consuming; while tuning your cluster to the job is comparatively straightforward.  Data Scientists at the Terabyte-company shouldn’t be pinned down by the difficulties of working with technologies that weren’t designed for them.  By enabling Hadoop in an elastic context — public or private cloud, internal or outsourced — Infochimps and others working on these challenges are a big part of breaking it out to the larger market.





84493d0d e63a 4f96 ae8b 01f76694dc55 Breaking Hadoop out to the Larger Market



Live Webcast:Top Strategies for Successful Big Data Projects

Title:Top Strategies for Successful Big Data Projects
Date: Thursday, November 29, 2012
Time: 10a Pacific/12p Central/1p Eastern

Register Infochimps Live Webcast:Top Strategies for Successful Big Data Projects

44% of Big Data projects don’t get fully deployed, and very few achieve intended business objectives. Common barriers to success surround securing executive buy-in, whether to build or leverage a third-party solution, determining scope and establishing realistic goals. Here are some tips to ensure your Big Data projects not only get off the ground and completed, but also quickly and positively impact your business’ bottom line.

Register for this live webinar and listen to Big Data expert and Infochimps Product Manager Tim Gasper, share insights and explain how to effectively execute your Big Data project, and avoid the most common pitfalls. In addition, you will learn:

  • Common roadblocks to Big Data project momentum
  • Elements of a clear project process and plan
  • Popular big data project objectives
  • Top 3 priorities for Big Data solutions

Join the webcast here. Looking forward to seeing you Thursday, November 29, 2012 @ 10a PT, 12p CT, 1p ET!




84493d0d e63a 4f96 ae8b 01f76694dc55 Live Webcast:Top Strategies for Successful Big Data Projects



Announcing Ironfan v4: Multicloud Capabilities + Community Support

Ironfan Announcing Ironfan v4: Multicloud Capabilities + Community Support Ironfan, the groundwork of the Infochimps Platform, is a systems provisioning, deployment, and updating tool that is built from a combination of proprietary technologies and open-source technologies like Chef and Fog.

After several proof-of-concepts and forks, hampered by the lack of underlying abstractions, we are happy to announce true multicloud capabilities for Ironfan. These capabilities update the current version to a largely similar feature-set to core Ironfan v3 (i.e. EC2 only). The current version is also ready for new providers, and VMware is working on catching up their fork of Ironfan, Serengeti, to use the latest code. This latest version has been undergoing heavy development and testing, including increasing third-party contributions, and we have an increased internal focus on expanding and hardening both the cookbooks and the knife plugin.

Interested in our growing community? Please join our new ironfan-users@infochimps.com mailing list, managed by Nick Marden from GetSatisfaction. We’d love to hear your feedback!

34523bb2 2e50 4f42 88a1 5bd9ed0fddac Announcing Ironfan v4: Multicloud Capabilities + Community Support

Next Gen Real-time Streaming with Storm-Kafka Integration

At Infochimps, we are committed to embracing cutting edge technology, while ensuring that the latest Big Data innovations are enterprise-ready. Today, we are proud to deliver on that promise once again by announcing the integration of Storm and Kafka into the Cloud::Streams component of the Infochimps Cloud.

StormKafka 1024x578 Next Gen Real time Streaming with Storm Kafka Integration

Cloud::Streams provides solutions for challenges involving:

  • Large-scale data collection - clickstream web data, social media and online monitoring, financial market data, machine-to-machine data, sensors, business transactions, listening to or polling application APIs and databases, etc.
  • Real-time stream processing - real-time alerting, tagging and filtering, real-time applications, fast analytical processing like fraud detection or sentiment analysis, data cleansing and transformation, real-time queries, distribution to multiple clients, etc.
  • Analytics system ETL - providing normalized/de-normalized data using customer-defined business logic for various analytics data stores and file systems including Hadoop HDFS, HBase, Elasticsearch, Cassandra, MongoDB, PostgreSQL, MySQL, etc.

Storm and Kafka

Recently in my guest blog post on TechCrunch, I mentioned why you should care about Storm and Kafka.

“With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.”

Ultimately, Storm and Kafka form the best enterprise-grade real-time ETL and streaming analytics solution on the market today. Our goal is to put the same technology that Twitter uses to process over 400 million tweets per day — in your hands. Other companies that have adopted Storm in production include Groupon, Alibaba, The Weather Channel, FullContact, and many others.

Nathan Marz, Storm creator and senior Twitter engineer, comments on Storm’s rapid growth:

“Storm has gained an enormous amount of traction in the past year due to its simplicity, robustness, and high performance. Storm’s tight integration with the queuing and database technologies that companies already use have made it easy to adopt for their stream computing needs.”

Storm solves a broad set of use cases, including “processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more.”

Apache Kafka, which was developed by LinkedIn to power its activity streams, provides an additional reliability guarantee, robust message queueing, and distributed publish-subscribe capabilities.

Cloud::Streams

Cloud::Streams is fault-tolerant and linearly scalable, and performs enterprise data collection, transport, and complex in-stream processing. In much the same way that Hadoop provides batch ETL and large-scale batch analytical processing, Cloud::Streams provides real-time ETL and large-scale real-time analytical processing — the perfect complement to Hadoop (or in some cases, what you needed instead of Hadoop).

Cloud::Streams adds important enterprise-class enhancements to Storm and Kafka, including:

  • Integration Connectors to your existing tech environment for collecting required data from a huge variety of data sources in a way that is robust yet as non-invasive as possible
  • Optimizations for highly scalable, reliable data import and distributed ETL (extract, transform, load), fulfilling data transport needs
  • Developer Toolkit for rapid development of decorators, which perform the real-time stream processing
  • Guaranteed delivery framework and data failover snapshots to send processed data to analytics systems, databases, file systems, and applications with extreme reliability
  • Rapid solution development and deployment, along with our expert Big Data methodology and best practices

Infochimps has extensive experience implementing Cloud::Streams, both for clients and for our internal data flows including large-scale clickstream web data flows, massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and more.

Obviously, data failover and optimizations are key to enterprise readiness. Above and beyond that though, Cloud::Streams is a joy to work with because of its flexible Integration Connectors and the Developer Toolkit. No matter where your data is, you can access and ingest it with a variety of input methods. No matter what kind of work you need to perform (parse, transform, augment, split, fork, merge, analyze/process, …) you can quickly develop that processor unit, test it, and deploy it as a Cloud::Streams decorator.

One of our most recent customers was able to build an entire production application flow for large-scale social media data analysis using the Infochimps Cloud development framework in just 30 days with only 3 developers. That is both unheard of from an enterprise timeline perspective, as well as an amazing case of business ROI. Big Data is too important to spend months and months developing. Your business needs results now, and the Infochimps Cloud leverages the talent you have today for fast project success.

How much is it worth to you to launch your own revenue generating applications for your customers? Or for your internal stakeholders as part of a Big Data business intelligence initiative? How much value would launching 12 months sooner provide your organization? These are questions which we’re trying to make the answer to obvious.

Steve Blackmon, Director of Data Sciences at W2O Group, explains why they are working with Infochimps and Cloud::Streams:

“Storm and Kafka are excellent platforms for scalable real-time data processing. We are very pleased that Infochimps has embraced Storm and Kafka for Cloud::Streams. This new offering gives us the opportunity to supplement our listening and analytics products with Infochimps’ data sources, to integrate capabilities seamlessly with our partners who also use Storm, and to retain Infochimps’ unique technical team to support and optimize our data pipelines.”

More Information

Check out the full press release here, including quotes from CEO Jim Kaskade and co-founder and CTO Flip Kromer.

You can access additional resources from the Cloud::Streams web page or our general resources directory.

Lastly, check out our previous product announcements! In February, we launched the Infochimps Platform. In April we launched Dashpot as well as our support of OpenStack. In August, we announced the Platform’s newest release.

6fefa857 2e95 4742 9684 869168ac7099 Next Gen Real time Streaming with Storm Kafka Integration



The 3 Waypoints of a Data Exploration

Part of our goal is to unlock the big data stack for exploratory analytics.

How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.

The 3 Waypoints of a Data Exploration:

  • What you knew — are they validated by the data?
  • What you suspect — how do your hypotheses agree with reality?
  • What you would have never suspected — something unpredictable in advance?

In Practice:
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message.  I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.

I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).

Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.

Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.

That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):

3 Waypoints 1024x742 The 3 Waypoints of a Data Exploration
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.

Now let’s look at the 3 Waypoints.

What We Knew: What I really mean by “knew”  is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:

  • Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
  • Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
  • Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).

What We Suspected:

First, about the clusters themselves:

  • Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
  • Japanese and Chinese are mashed together, because both use characters from the Han script.

Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.

Things we suspected about the connections:

  • Nearby countries will show more “mashups”.  Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
  • Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.

What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.

The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: http://www.revfad.com/flip.html) My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.

As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.

However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?

3 Waypoints Smiley The 3 Waypoints of a Data Exploration

Because looks like a smiley face.
The common bond across all of humanity is a smile.


6fefa857 2e95 4742 9684 869168ac7099 The 3 Waypoints of a Data Exploration


Predictive Analytics Summit in New York

Predictive Analytics Predictive Analytics Summit in New York

Our partners at *IE. would like to introduce you to the exclusive Predictive Analytics Summit for Banking & Financial Services, at the Conrad New York on December 6 & 7, 2012.

This summit will bring together the leaders and innovators from the banking industry together for two days of unparalleled networking with like-minded professionals.

This event will combine keynote presentations with open discussion and interactive workshops in an event acclaimed for its innovative insight. It is a unique opportunity to share challenges and best practices with leaders in a collaborative environment.

Register Today.





34523bb2 2e50 4f42 88a1 5bd9ed0fddac Predictive Analytics Summit in New York