- October 28, 2012
Part of our goal is to unlock the big data stack for exploratory analytics.
How do you know when you’ve found the right questions? That you’ve gone deep enough to trust the answers? Here’s one sign.
The 3 Waypoints of a Data Exploration:
- What you knew — are they validated by the data?
- What you suspect — how do your hypotheses agree with reality?
- What you would have never suspected — something unpredictable in advance?
A while back, a friend asked me about signals in the Twitter stream for things like “Spanglish” — multiple languages mixed in the same message. I did a simple exploration of tweets from around the world (simplifying at first to non-english languages) to see how easy such messages are to find.
I took 100 million tweets and looked for only those “non-keyboard” characters — é (e with acute accent) or 猿 (Kanji character meaning ‘ape’) or even ☃ (snowman).
Using all the cases where there were two non-keyboard characters in the same message, I assembled the following graph.
Imagine tying a little rubber band between every pair of characters, as strong as the number of times they were seen hanging out together; also, give every character the desire for a bit of personal space so they don’t just pile on top of each other. It’s a super-simple model that tools like Cytoscape or Gephi will do out-of-the-box.
That gave this picture (I left out the edges for clarity and hand-arranged the clusters at the bottom):
This “map” of the world — the composition of each island, and the arrangement of the large central archipelago — popped out of this super-simplistic model. It had no information about human languages other than “sometimes, when a person says 情報 they also say 猿.” Any time the data is this dense and connected, I’ve found it speaks for itself.
Now let’s look at the 3 Waypoints.
What We Knew: What I really mean by “knew” is “if this isn’t the case, I’m going to suspect my methods much more strongly than the results”:
- Most messages are in a single language, but there are some crossovers. After the fact, I colored each character by its “script” type from the Unicode standard (i.e. Hangul is in cyan). As you can see, most of the clouds have a single color.
- Languages with large alphabets have tighter-bound clouds, because there are more “pairs” to find (i.e. The Hiragana character cloud is denser than the Arabic cloud).
- Languages with smaller representation don’t show up as strongly (i.e. There are not as many Malayam tweeters as Russian (Cyrillic) tweeters).
What We Suspected:
First, about the clusters themselves:
- Characters from Latin scripts (the accented versions of the characters English speakers are familiar with) do indeed cluster together, and group within that cluster. Many languages use ö, but only subsets of them use Å or ß. You can see rough groups for Scandinavian, Romance and Eastern-European scripts.
- Japanese and Chinese are mashed together, because both use characters from the Han script.
Second, about the binds between languages. Clusters will arrange themselves in the large based on how many co-usages were found. A separated character dragged out in the open is especially interesting — somehow no single language “owns” that character.
Things we suspected about the connections:
- Nearby countries will show more “mashups”. Indeed, Greek and Cyrillic are tightly bound to each other, and loosely bound to European scripts; Korean has strong ties to European and Japanese/Chinese scripts. This initial assumption was partially incorrect though — Thai appears to have stronger ties to European than to Japanese/Chinese scripts.
- Punctuation, Math and Music are universal. Look closely and you’ll see the fringe of brownish characters pulled out into “international waters”.
What We Never Suspected in Advance: There were two standouts that slapped me in the face when taking a closer look.
The first is the island in the lower right, off the coast of Europe. It’s a bizarre menagerie of Amharic, International Phonetic Alphabet and other scripts. What’s going on? These are characters that taken together look like upside-down English text: “¡pnolɔ ǝɥʇ uı ɐʇɐp ƃıq“. (Try it out yourself: http://www.revfad.com/flip.html) My friend Steve Watt’s reaction was, “so you’re saying that within the complexity of the designed-for-robots Unicode standard, people found some novel, human, way to communicate? Enterprises and Three Letter Agencies dedicate tons of resources for such findings”.
As soon as you’ve found a new question within your answers you’ve reached Waypoint 3 — a good sign for confidence in your results.
However, my favorite is the one single blue (Katakana) character that every language binds to (see close-up below). Why is Unicode code point U+30C4 , the Katakana “Tsu” character, so fascinating?
Because シ looks like a smiley face.
The common bond across all of humanity is a smile.