- December 1, 2011
Here’s a problem that’s harder than it seems: Where is Paris? Any simple response proves more ambiguous and brittle than you would expect. But across an ocean of data lies a new way to discover answers, one that accommodates complexity because it is sourced in complexity.
A reasonable answer is “the political boundary of the city of Paris, France”. If you were walking through France, and were prone to making silly graphs, you might draw this plot of “How much am I in Paris?”
You’d be in not-Paris, hit the city limits, and immediately be in-Paris. No ambiguity.
No ambiguity, unless you instead meant any of Paris’ other defined statistical areas: its aire urbaine (metropolitan area), pôle urbain (urban area) or couronne périurbaine (commuter belt). “Paris” is a perfectly reasonable reply to the question “where are you from?”, even if you live in Versailles, miles outside Paris’ formal boundary.
Or keep getting your Jules Verne on straight through Paris, France and across the globe. You’ll eventually wind up in Paris, Texas, population: 26,000. In other directions, you’ll find Parises in Ontario, Kentucky, Denmark, and 20 others scattered across the globe.
Each of the Parises above is decidedly objective: an unambiguous boundary defined by a central authority. However, insisting on a single ‘correct’ answer leads to a brittle notion of truth, one that poorly accommodates context.
With the information age fully upon us, there’s another way to answer the question “Where is Paris?”, and it has remarkable implications for how we master these massive information streams.
To start, take 100 million Twitter users and note their time zone, the text they entered for “location”, and (where available) the geolocation of their tweets. Here’s what the “fraction of people claiming to be from Paris” would look like as you traveled across the globe:
What you get is a noisy collection of people:
- from Paris, Texas;
- from Paris, France;
- from Paris, but living in New York;
- living in New York, but wishing they lived in Paris;
- … and so on.
What’s worse, we don’t even have an answer to the question “Where is Paris?”: just a fuzzy, skewed answer to the question “When Twitter users say they are from Paris, where do they spend their time?” Formerly, we would look at this as noise — as a problem brought on by having too much data.
The solution to a “too much data” problem is actually more data.
Connect it to the population density of every spot on the globe will roughly correct for the skewed distribution of Twitter users, giving you a proxy for “When people say they are from Paris, where do they spend their time?”
Next, take every geotagged photo labelled “Paris” on Flickr.com. This data is even more skewed, containing only photogenic places visited by advanced users.
But here’s the key: where it’s “wrong”, it’s wrong in a different way.
Let’s keep going down this path. Pull in data from Facebook, Foursquare, LinkedIn and the rest. Besides just using users’ locations, pull in all geolocated content mentioning Paris: “Where are people when they talk about Paris?”.
Enough social media. Take the press release newswire, and tie every mention of “Paris” back to the company’s address. Use the defined place names mentioned at the top, weighted by population density. Digest a petabyte of newspaper articles and books to identify other defined place names that co-occur with the term “Paris”.
Where each of these overlap, it reflects a different layer of human activity that ties to the concept of Paris-ness, and where they do not, it shows that aspect is missing.
At some point, the signal is strong enough that we have to stop regarding it as messy and skewed, and start focusing on the fact that it’s organic. It’s not Paris as a brittle on/off switch, it’s Paris as our human condition defines it. Because it’s fuzzy and complicated, it adapts to context. If I know I’m talking about business, or social interaction, or have just seen a nearby place name, I can use that to hint how much confidence I assign each source of truth.
So: truth is overrated, Paris is beautiful whereever it is, and the solution to too much data is more data.