Pop Data

[Infographic] Taming Big Data from Wikibon

Opening with a Big Data market forecast, to ending with a shout-out for all industries to embrace Big Data as the definitive source of competitive advantage, the following infographic from Wikibon personifies Big Data as a beast (data volumes are growing exponentially) that can be tamed (thanks to new approaches for processing, storing and analyzing).  It includes real-world Big Data use cases, which I appreciated. I was most amazed by how “decoding the human genome used to take ten years, but can now be done in 7 days.”

The quote from Kevin Weil, the Director of Product for Revenue at Twitter brings the benefit of valuable Big Data insights home: “It’s no longer hard to find the answer to a given question; the hard part is finding the right question and as questions evolve, we gain better insight into our ecosystem and our business.”

Scroll down, geek out on the infographic, and if you want more, check out an oldie but goodie article:  6 Illuminating Big Data Infographics

Taming Big Data [Infographic] Taming Big Data from Wikibon

Did you notice the chimp within the Big Data forecast?

Thank you Wikibon for posting this!

84493d0d e63a 4f96 ae8b 01f76694dc55 [Infographic] Taming Big Data from Wikibon

The 3 Waypoints of a Data Exploration

Part of our goal is to unlock the big data stack for exploratory analytics.

How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.

The 3 Waypoints of a Data Exploration:

  • What you knew — are they validated by the data?
  • What you suspect — how do your hypotheses agree with reality?
  • What you would have never suspected — something unpredictable in advance?

In Practice:
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message.  I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.

I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).

Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.

Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.

That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):

3 Waypoints 1024x742 The 3 Waypoints of a Data Exploration
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.

Now let’s look at the 3 Waypoints.

What We Knew: What I really mean by “knew”  is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:

  • Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
  • Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
  • Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).

What We Suspected:

First, about the clusters themselves:

  • Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
  • Japanese and Chinese are mashed together, because both use characters from the Han script.

Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.

Things we suspected about the connections:

  • Nearby countries will show more “mashups”.  Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
  • Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.

What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.

The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: http://www.revfad.com/flip.html) My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.

As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.

However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?

3 Waypoints Smiley The 3 Waypoints of a Data Exploration

Because looks like a smiley face.
The common bond across all of humanity is a smile.

6fefa857 2e95 4742 9684 869168ac7099 The 3 Waypoints of a Data Exploration

Is “Big Data” the Wrong Term?

It’s likely that, like myself, you have heard again and again about “big data“, its 3 V’s, and the Hadoop brand. Yes, volume, velocity, and variety of data are making it difficult to use traditional data solutions like BI cubes, relational databases, and bespoke data pipelines. The world needs new superheroes like Hadoop, NoSQL, NewSQL, DevOps, etc. to solve our woes.

Big Data Is Big Data the Wrong Term?

However, these new technologies and approaches have done much more than just solve the problems around petabytes of data and thousands of events per second. They are the right way to do data. That’s why I’m not convinced the term “big data” was a good choice for us to land on as an industry. It’s really “smart data” or “scalable data.” And despite my distaste for adding a version number to buzz phrases, even “Data 2.0” would be more apt.

If you are a CTO/CIO, system architect, manager, consultant, developer, sys admin, or simply an interested professional – my goal is to prompt some initial points on why big data constitutes a good approach to data management and analytics, regardless of the speed and quantity of data.

Scalable Data: Multi-Node Architecture and Infrastructure-as-Code

Multi-node systems with distributed, horizontally scalable systems are always the right way to do infrastructure, no matter the size of your data or the size of your IT team. This wasn’t always the case, but now multi-node systems are as easy to manage as single-node solutions. It’s so easy now because monitoring, logging, management software, and more are all baked right in; systems come to life in a coordinated fashion that hides all the complexity and scales as needed. You can test your infrastructure in the same way you test programming code. While manually testing a multi-node system may be difficult, testing a piece of code is straightforward.

One of the worst things that can happen to an IT team is having to manage major architecture changes. Using open source, multi-node technologies with an infrastructure-as-code foundation lets organizations grow organically and swap tools and software in and out as needed. Simply modify your infrastructure definitions, test your code, and deploy. Additionally, this kind of framework works perfectly with the DevOps approach to system management. Code repositories are collaborative and iterative – giving individual developers empowerment to directly manage infrastructure, while having the safeguards and tests in place to ensure reliability and quality.

Smart Data: Machine Learning and Data Science

You don’t have to have petabytes of data to begin implementing smart algorithms. To run your business more efficiently, you need to be predictive. You must forecast business and market trends before they happen so you can anticipate how to steer your organization. The companies that win will be the ones analyzing and understanding as much data as possible – building data science as a key competency. Big data tools are making it easier to work with data by providing tools like Mahout for machine learning, Hive for business intelligence queries, or R for statistical analysis, which can interface with Hadoop. Because of big data architecture, you can keep data fresh, use a larger swath of data, and use the newest, most powerful tools to perform the analysis and processing.

Agnostic Data: The Right Database for Each Job

New data pipelining frameworks enable real-time stream processing with multi-node scalability and the ability to fork or merge flows. What that means is, you can easily support multiple databases for multiple problems: columnar stores as primary data stores, relational databases for reporting, search databases for data exploration, graph databases for relationship data, document stores for unstructured data, etc. Because of data splitting/merging capabilities, and your DevOps infrastructure ensuring your databases have integrated monitoring and logging, the added burden of having more than one database is minimum. You just have to learn how to interface with the data through easy-to-use APIs and client libraries.

Holistic Data: Hadoop is Not The End All, Be All

Finally, let’s tackle Hadoop specifically. Hadoop is oriented around large-scale batch processing of data. But so much of what big data is includes databases, data integration/collection, real-time stream processing, and data exploration. Hadoop is not a one trick pony, but it’s also not the answer to every data problem known to man.

Frameworks like Flume, Storm, and S4 are making it easier to perform streaming processing such as collecting hundreds of tweets per seconds, thousands of ad impressions per second, or processing data in near real-time as data flows to its destination (whether a database, Hadoop filesystem, etc.). New database technologies are providing more powerful ways of querying data and building applications. R, Hive, Mahout, and more are providing better data scientist tools. Tableau, Pentaho, GoodData, and others are pushing the envelope with data visualization and big data dashboarding.


Big data software and frameworks are the right foundation for data + data integration and collection + data science + statistical analysis + infrastructure management and administration + IT scaling + data-centric applications + data exploration and visualization. Often regardless of data size.

Your organization benefits from adopting these best practices early and working with vendors that understand your company’s problem isn’t just “oh no, I have too much data”. It’s all about return on investment. The big data approach lowers overhead, enables faster and more efficient IT infrastructure management, generates better insights, and puts them to work in your organization.

DeepDive 728px v3 Is Big Data the Wrong Term?

[Image Source]

Eating Towns and Drinking Towns

Trulia Restaurant Density Heatmap Eating Towns and Drinking Towns

In another well done data analysis from Trulia, the real estate technology company uses US Census data to map out the country’s bars and restaurants.  Perhaps unsurprisingly, San Francisco reigns supreme in the restaurant contest, with one restaurant for every 243 households in the city.  Trulia compares this data to the median price per square foot for for-sale houses and in that chart, it quickly becomes clear that in general, higher income provides for a greater ability to patronize (and support) a bustling restaurant culture.

Top Metros for Eating Out
# U.S. Metro Restaurants per 10,000 households Median price per sqft of for-sale homes
1 San Francisco, CA 39.3 $459
2 Fairfield County, CT 27.6 $222
3 Long Island, NY 26.5 $217
4 New York, NY-NJ 25.3 $275
5 Seattle, WA 24.9 $150
6 San Jose, CA 24.8 $319
7 Orange County, CA 24.8 $260
8 Providence, RI-MA 24.3 $146
9 Boston, MA 24.2 $219
10 Portland, OR-WA 24.0 $129

Note: among the 100 largest metros.

Can you guess which city in the US has the greatest number of bars per capita?  I’ll give you a hint – you can get drive-thru margaritas and the city is nicknamed “The Big Easy”.  Yup, good ol’ New Orleans ranks #1 with one bar for every 1,173 households.  Interestingly, the median price per square foot for for-sale houses is significantly lower than for San Francisco, which ranks #8 by this measure.  It looks like sustaining a thriving bar scene does not have the same income requirements as restaurants.

Top Metros for Drinking
# U.S. Metro Bars per 10,000 households Median price per sqft of for-sale homes
1 New Orleans, LA 8.6 $99
2 Milwaukee, WI 8.5 $109
3 Omaha, NE-IA 8.3 $79
4 Pittsburgh, PA 7.9 $91
5 Toledo, OH 7.2 $71
6 Syracuse, NY 7.0 $86
7 Buffalo, NY 6.8 $91
8 San Francisco, CA 6.0 $459
9 Las Vegas, NV 6.0 $69
10 Honolulu, HI 5.9 $390

Note: among the 100 largest metros.

Trulia Bar Density Heatmap Eating Towns and Drinking Towns

I’d love to see these maps overlaid for a compare and contrast of the various metro areas featured in this analysis.  Interesting, it looks like the middle of the country has a considerably higher density of bars (relative to the rest of the country) than it does restaurants.

The Value of an Olympic Medal

MeddlingWithTheGold 501bfb30b53e1 The Value of an Olympic Medal

Olympic medals may be a lot of facade (a gold medal only had 1.34% gold content?), but they can come with big cash prizes.  The US Olympic committee will dole out in upwards of $25,000 for a gold medalist.  Countries such as Italy or Russian who pay $182,000 and $135,000, respectively to their countries top performers.  Surprisingly, the UK, this year’s host, does not provide any monetary compensation to their athletes for bringing home the gold.

Curious: Time Delays and Rover Landings

tumblr m8chlqsAak1qewacoo1 500 Curious: Time Delays and Rover Landings

Anyone else stay up all night watching Curiosity land on Mars?

Thanks, I Love Charts!

Was An Olympic Record Set Today?

Olympic record 625x517 Was An Olympic Record Set Today?

Put together with Google Docs, github, and the New York Times Olympic API, this microsite from the Guardian US answers the question, “Was an Olympic record set today”?  It’s going to be mighty sad for about four years after August 12th. ;)

Thanks, Flowing Data for posting this!

Animated Map of the US

changingusa thumb Animated Map of the US

We found this little gem on Chart Porn.  It provides a neat visualization of the changing landscape of the United States since its inception.  We agree with the folks at Chart Porn that adding timeline control would make this map really awesome (and a useful study aid for American history).

Olympic Body Doubles and the Global Fat Check

bodytype Olympic Body Doubles and the Global Fat Checkfatindex Olympic Body Doubles and the Global Fat Check


In the past month, the BBC has released two interactive features around bodies that we found particularly interesting.  In one, you can find out which Olympic athlete has the same height and weight as you and in another, you can see where you rank on the global obesity index.  They are great examples of using publicly available data to create engaging and educational experiences.  Check ‘em out and let us know what you think about this kind of data presentation.

You Don’t Have to Be Sixteen…

Actually, you do have to turn sixteen within the year of competition in order to be eligible to compete for the Olympics; however, the general perception of all Olympians being teenagers largely comes from the popular sport of women’s gymnastics, which in the past three Olympic Games has had no competitors over the age of 26.  However, the mid-twenties are a prime age group for Olympic competition.  In fact, most medalists (and gold medalists) are in their twenties.

Check out this cool interactive chart from the Washington Post and see which Olympic sports you may be most competitive in at your age.

averageage You Dont Have to Be Sixteen...

And here’s some notable Olympians who have hit their peak later in life:

Dara Torres, who  at 41,  is the oldest swimmer to ever earn a place on the US Olympic team (2008 Summer Olympics).  She’s a mom of one, who at 40 beat her own American record for the 50-meter freestyle (she originally set the record when she was 15).

John Dane III, owner of Trinity Yachts, the country’s largest mega-yacht builder earned a spot on the 2008 US Olympic sailing team at age 58.  He had been trying out for the Olympics since he was 18 and achieved his dream after 40 years!