Monthly Archives May 2012

The Reality of Bytes

Big Data is more than just the latest buzz word, it’s serious business and it’s growing at a nearly unimaginably fast rate. Today, Cisco Systems reports that Global Internet Protocol traffic will reach an annual run rate of 1.3 zettabytes in 2016 up from 368.8 exabytes in 2011.

What are zettabytes and exabytes, you ask? Well, an exabyte is the equivalent of one million gigabytes; a zettabyte is one trillion gigabytes. This increase in traffic is largely driven by the proliferation of mobile devices, including tablets and smart phones.

So, how do you make sense of the unfathomably large quantities of data we are creating today? Get a grip on the basics of bytes by checking out the Cisco Visual Networking Index, as well as this great chart they put together below.

IP Traffic Term Equivalent How much is that?
1 Petabyte 1,000 Terabytes or 250,000 DVDs 100 Petabytes
The amount of data produced in a single minute by the new particle collider at CERN.

400 Terabytes
A digital library of all books ever written in any language

1 Exabyte 1,000 Petabytes or 250 million DVDs 5 Exabytes
A transcript of all words ever spoken

100 Exabytes
A video recording of the all the meetings that took place last year
across the world

150 Exabytes
The amount of data that has traversed the Internet since its creation

175 Exabytes
The amount of data that will cross the Internet in 2010 alone

1 Zettabyte 1,000 Exabytes or 250 billion DVDs 66 Zettabytes
The amount of visual information conveyed from the eyes to the brain of the entire human race in a single year
1 Yottabyte 1,000 Zettabytes or 250 trillion DVDs 20 Yottabytes
A holographic snapshot of the earth’s surface

Exploring Big Data as an Agency Product

↳ The Future of Big Data in the Agency

Join us for our free webinar on Real-Time Analytics for Agencies

Big Data is changing the game for companies of all shapes and sizes, including agencies looking to take their social media technologies and customer insight practices to the next level. But, managing the massive velocity, volume, and variety of social media and otherdata sets at scale can be a huge challenge. Infochimps has built the largest open marketplace of data sets in world. Now, we’ve now opened up our platform and work with some of the world’s top digital, advertising, and PR agencies, which use the Infochimps platform to broaden and scale their proprietary data offerings through:

  • Sentiment and Influencer Analysis
  • Client Customer Insights
  • Real-Time Social Media Analytics
  • Infographic and Report Generation
  • Meme tracking

We’re having a webcast on Thursday, May 31 @ 11:00 CST, titled Real-Time Analytics: The Future of Big Data in the Agency.  Infochimps’ co-Founder, Dhruv Bansal, one of the world’s leading data scientists, will present a quick demonstration on how agencies can build their own Big Data infrastructure, distribute costs across multiple clients while growing their product offerings with Big Data – in a fraction of the time you’d expect and for a fraction of the cost of Big Data talent, enterprise consultants and/or custom enterprise solutions.  We’d love for you to attend and participate.

Learn More

Texas Has Chest Congestion

whatswrongwithtexas Texas Has Chest Congestion

Here’s a great example of how one company takes Big Data and makes it fun.  Help is a drug company that strives to simplify the pharmaceutical choices for customers.  Their website now features a map highlighting sales data from Target and Walgreens called “What’s wrong U.S.?”.  A bar chart for each state shows how many people are buying products for particular ailments versus the national average; you can also click on the state and get region by region details.  For example, Central Texas, home to our home, Austin, TX has a higher than average number of blisters.  Maybe it’s all the running and biking we do!

(via Flowing Data)

S3Chimp: Information Science in Action

selene arrazolo S3Chimp: Information Science in ActionI’m Selene, Infochimps’ new Analyst. Prior to my new position, I was an Infochimps intern. I recently graduated from the School of Information at the University of Texas with a Master’s of Science in Information Studies. As part of my MSIS degree plan, I completed a semester long project entitled: Developing and Integrating a Lightweight Metadata System into a Data Ingestion Workflow here at Infochimps, Inc.

The main ingredients of the project were Ruby on Rails, MongoDB, and everyone’s favorite, Amazon Web Services. The result is an alpha stage of the tentatively named S3Chimp. It is an addition to Dashpot, our Analytics & Operations Dashboard for the Infochimps Platform. Dashpot boasts an easy-to-use analytics and operations dashboard that provides business metrics and visualization, cluster management capabilities, and system monitoring on top of the Infochimps Platform. Integrating a lightweight metadata system into the workflow makes it possible for Dashpot to also track and organize distributed massive-scale data assets. What was once time-consuming (according to us as well as various people in the industry), can now be a dynamic part of an organization’s internal analytics.

Before I could begin making S3Chimp, organizing the Infochimps Amazon S3 Buckets was key. Perhaps a company that boasts about its command of data should have a beautifully organized set of buckets? Perhaps….  But let’s pretend that is not the case. And let us imagine that a young and excited Information Studies graduate student decides to tackle the S3 clutter. The essential steps in such a scenario include designing a thought-out schema guideline tailored to the company’s needs and data types, and insensately enforcing those guidelines.

Next on the list was learning Ruby on Rails, over several weeks. It was a baptism by fire. I learned the very basics of Ruby on Rails and how to love the MVC trinity. Ruby on Rails is a smart and fun web app framework and it was an enjoyable experience, relative to PHP. Relative to a Saturday afternoon at Barton Springs? Not so much.

With a snazzy script written in the enchanted Infochimps Data Mine, I was able to take the most exciting leap which was taking metadata from the now beautifully organized S3 buckets, and injecting it into MongoDB, a NoSQL database. The result is the S3Chimp genesis. S3Chimps is a system that that tells you what data and how much of it is in AWS, all from your analytics dashboard. Future plans for this product include making a tool to capture provenance metadata, and other goodies.

mongo db huge logo S3Chimp: Information Science in ActionYou can find me at the upcoming MongoDB NYC conference, if you’d like to ask me about our awesome new Ironfan Platform, Dashpot, or my CapStone project.

I’d like to thank my Field Supervisor, Flip Kromer as well as my Faculty Adviser, Dr. Melanie Feinberg.

Keep an eye out for my next blog post where I will be chronicling my personal Ruby on Rails adventure that is near and dear to my librarian heart. Travis Dempsey and I will make an in-house database of our office library’s catalog. The Bukfin Repostiry’s catalog is currently housed in Librarything.

How We Do It

this is how we do it How We Do ItInfochimps uses many cutting edge tools (Chef, Amazon Web Services, Hadoop, Hbase, ElasticSearch, Flume, MongoDB, Phantom.js, etc. ad nauseum), and we’ve written a number of custom tools to help corral these sometimes wild horses into a working team. Ironfan, our Chef specialization for big-data in the cloud, coordinates the installation and configuration of the many necessary components. Wukong is our Ruby library for Hadoop, combining the flexibility of JRuby with the raw power of MapReduce. Wonderdog is our Hadoop interface to ElasticSearch, allowing us to deliver large amounts of data quickly into a stable and searchable NoSQL data stores. Swineherd, the workflow engine for Hadoop jobs, helps tie all of this together into a coherent framework for running multi-stage data ingestions.

To crib a DevOps aphorism, however, it’s not the technology that makes Infochimps work: it’s the culture. Specifically, it’s about culture that keeps the challenges from all that novel technology manageable.


Why Geeks Win

We found this great little chart on Chart Porn today and thought it was an excellent representation of the foundations of our company.  Yay, geeks!

image13 Why Geeks Win

Why Real-Time Analytics? [Free White Paper]

realtime analytics Why Real Time Analytics? [Free White Paper]

When you think Big Data, the first words that come to mind are often Hadoop and NoSQL, but what do these technologies actually mean for your business?  Different Big Data technologies have different use cases where they work best.  For your real-time Big Data challenges often a very different class of tools must be implemented.

In this free white paper, we’ll explore:

  • How to create a flexible architecture that allows you to use the best Big Data tools and technologies for the job at hand
  • Where Hadoop analysis and NoSQL databases work and where they can fall short
  • How Hadoop differs from real-time analytics and stream processing approaches
  • Visual representations of how real-time analytics works and real world use cases
  • How to leverage the Infochimps Platform to perform real-time analytics

The Era of Big Data and What It Means For You

cookie2.r The Era of Big Data and What It Means For You

When it comes to predicting the future, your best resource (short of a soothsayer) is historical data.  As data collection, storage and processing has become more sophisticated, the volume of data has exploded. A recent article in the McKinsey Quarterly, states that in the US, across most business sectors, companies with more than 1000 employees store, on average, over 235 terabytes of data – more data than contained in the entirety of the US Library of Congress.

What does this mean?  It means that companies are sitting on a goldmine of insights for competitive advantage.  The McKinsey Quarterly article mentions this example:

The top marketing executive at a sizable US retailer recently found herself perplexed by the sales reports she was getting. A major competitor was steadily gaining market share across a range of profitable segments. Despite a counterpunch that combined online promotions with merchandizing improvements, her company kept losing ground.

When the executive convened a group of senior leaders to dig into the competitor’s practices, they found that the challenge ran deeper than they had imagined. The competitor had made massive investments in its ability to collect, integrate, and analyze data from each store and every sales unit and had used this ability to run myriad real-world experiments. At the same time, it had linked this information to suppliers’ databases, making it possible to adjust prices in real time, to reorder hot-selling items automatically, and to shift items from store to store easily. By constantly testing, bundling, synthesizing, and making information instantly available across the organization—from the store floor to the CFO’s office—the rival company had become a different, far nimbler type of business.

The amount of data we produce is staggering and the underlying possibilities are incredible, but that doesn’t necessarily mean companies have the ability to extract true value from their data.

Looking to understand how Big Data can revolutionize how your organization does business?  Sign up for a free Big Data consultation with some of our leading data scientists to get started today!

Milking Big Data in Pursuit of… More Milk

organic dairy cows Milking Big Data in Pursuit of... More Milk

A recent article from The Atlantic explores how Big Data has revolutionized the dairy industry.  In the past sixty years, through innovations in dairy science, milk production from an individual dairy cow has gone up from an average 5,000 pounds of milk in a lifetime to 21,000 pounds of milk.  This astonishing increase has largely been fueled by data-driven predictions that allow dairy breeders to optimize their herds.

Dairy breeding is perfect for quantitative analysis. Pedigree records have been assiduously kept; relatively easy artificial insemination has helped centralized genetic information in a small number of key bulls since the 1960s; there are a relatively small and easily measurable number of traits — milk production, fat in the milk, protein in the milk, longevity, udder quality — that breeders want to optimize; each cow works for three or four years, which means that farmers invest thousands of dollars into each animal, so it’s worth it to get the best semen money can buy. The economics push breeders to use the genetics.