Big Data News

Hadoop in the Cloud – Infochimps and VMware

logo Hadoop in the Cloud   Infochimps and VMware

Infochimps is proud to be a part of a new effort launched today by VMware to enable big data applications running on Hadoop to be deployed more easily on top of virtual and cloud-based IT environments. The Serengeti project, released today under the Apache 2.0 license, is built upon a number of open source technologies including our own Ironfan tool and supports all major Hadoop distributions including Cloudera, Greenplum, Hortonworks, and MapR.

Ironfan is the foundation of the Infochimps Platform and the basis of our customers’ Big Data deployments. It makes provisioning and configuring Big Data infrastructure simple – you can easily spin up clusters when you need them and kill them when you don’t, so our customers can spend their time, money, and engineering focus on finding insights, not configuring and deploying machines. Ironfan is quickly becoming the number one deployment tool for Hadoop platforms in the cloud, and this endorsement by VMware and inclusion in Serengeti is further evidence of the popularity of the tool.

What does Serengeti mean for Infochimps users?
From the beginning, the Infochimps Platform has been built on a foundation of open source tools for managing data that simplify the experience of working with complex technologies such as Hadoop. Within the Infochimps Platform, Ironfan, as well as other tools like Wukong and Swineherd, are major open sourced components of the stack. And with our enterprise tools including Data Delivery Service and Dashpot, customers can deploy complete Big Data environments and be assured of highly reliable delivery of data to their Hadoop environments.

The Serengeti project supports our open source tradition with its strong open source foundation and support by all of the major Hadoop distributions. Within the Serengeti project, Ironfan enables users to quickly and easily configure and deploy Hadoop clusters on top of VMware vSphere® in minutes with a single command. Now, users running VMware’s virtual and cloud infrastructure can more easily take advantage of the power of Hadoop as well as other Big Data technologies like the Infochimps Data Delivery Service, Dashpot, and Infochimps big data expertise to manage, process, and analyze massive amounts of unstructured, semi-structured, or structured data at scale and in the cloud.

We’re excited to be included in Serengeti and look forward to working with VMware customers and partners as they further their use of Big Data technologies.

Interested in learning more about Infochimps, VMware, and Serengeti? Contact us today for more information!

Why the American Community Survey is Important

The American Community Survey is an ongoing statistical survey that samples a small percentage of the population every year. It’s one of our most popular APIs in the Data Marketplace and the data within it provides the key data for the Digital Elements IP Intelligence Demographics API.

Learn more about the importance and usefulness of this annual supplement to the US Census.

(via Flowing Data)

The Reality of Bytes

Big Data is more than just the latest buzz word, it’s serious business and it’s growing at a nearly unimaginably fast rate. Today, Cisco Systems reports that Global Internet Protocol traffic will reach an annual run rate of 1.3 zettabytes in 2016 up from 368.8 exabytes in 2011.

What are zettabytes and exabytes, you ask? Well, an exabyte is the equivalent of one million gigabytes; a zettabyte is one trillion gigabytes. This increase in traffic is largely driven by the proliferation of mobile devices, including tablets and smart phones.

So, how do you make sense of the unfathomably large quantities of data we are creating today? Get a grip on the basics of bytes by checking out the Cisco Visual Networking Index, as well as this great chart they put together below.

IP Traffic Term Equivalent How much is that?
1 Petabyte 1,000 Terabytes or 250,000 DVDs 100 Petabytes
The amount of data produced in a single minute by the new particle collider at CERN.

400 Terabytes
A digital library of all books ever written in any language

1 Exabyte 1,000 Petabytes or 250 million DVDs 5 Exabytes
A transcript of all words ever spoken

100 Exabytes
A video recording of the all the meetings that took place last year
across the world

150 Exabytes
The amount of data that has traversed the Internet since its creation

175 Exabytes
The amount of data that will cross the Internet in 2010 alone

1 Zettabyte 1,000 Exabytes or 250 billion DVDs 66 Zettabytes
The amount of visual information conveyed from the eyes to the brain of the entire human race in a single year
1 Yottabyte 1,000 Zettabytes or 250 trillion DVDs 20 Yottabytes
A holographic snapshot of the earth’s surface

The Era of Big Data and What It Means For You

cookie2.r The Era of Big Data and What It Means For You

When it comes to predicting the future, your best resource (short of a soothsayer) is historical data.  As data collection, storage and processing has become more sophisticated, the volume of data has exploded. A recent article in the McKinsey Quarterly, states that in the US, across most business sectors, companies with more than 1000 employees store, on average, over 235 terabytes of data – more data than contained in the entirety of the US Library of Congress.

What does this mean?  It means that companies are sitting on a goldmine of insights for competitive advantage.  The McKinsey Quarterly article mentions this example:

The top marketing executive at a sizable US retailer recently found herself perplexed by the sales reports she was getting. A major competitor was steadily gaining market share across a range of profitable segments. Despite a counterpunch that combined online promotions with merchandizing improvements, her company kept losing ground.

When the executive convened a group of senior leaders to dig into the competitor’s practices, they found that the challenge ran deeper than they had imagined. The competitor had made massive investments in its ability to collect, integrate, and analyze data from each store and every sales unit and had used this ability to run myriad real-world experiments. At the same time, it had linked this information to suppliers’ databases, making it possible to adjust prices in real time, to reorder hot-selling items automatically, and to shift items from store to store easily. By constantly testing, bundling, synthesizing, and making information instantly available across the organization—from the store floor to the CFO’s office—the rival company had become a different, far nimbler type of business.

The amount of data we produce is staggering and the underlying possibilities are incredible, but that doesn’t necessarily mean companies have the ability to extract true value from their data.

Looking to understand how Big Data can revolutionize how your organization does business?  Sign up for a free Big Data consultation with some of our leading data scientists to get started today!

Milking Big Data in Pursuit of… More Milk

organic dairy cows Milking Big Data in Pursuit of... More Milk

A recent article from The Atlantic explores how Big Data has revolutionized the dairy industry.  In the past sixty years, through innovations in dairy science, milk production from an individual dairy cow has gone up from an average 5,000 pounds of milk in a lifetime to 21,000 pounds of milk.  This astonishing increase has largely been fueled by data-driven predictions that allow dairy breeders to optimize their herds.

Dairy breeding is perfect for quantitative analysis. Pedigree records have been assiduously kept; relatively easy artificial insemination has helped centralized genetic information in a small number of key bulls since the 1960s; there are a relatively small and easily measurable number of traits — milk production, fat in the milk, protein in the milk, longevity, udder quality — that breeders want to optimize; each cow works for three or four years, which means that farmers invest thousands of dollars into each animal, so it’s worth it to get the best semen money can buy. The economics push breeders to use the genetics.