Social Media Schema Mapping: Increasing the Power of Data

Infochimps recently developed a unified system for six different social media schemas from Gnip and Moreover. Gnip normalizes data from Facebook, Twitter, and Youtube into Activity Streams. Moreover feeds of forums, blogs, and news reports are normalized as XML in the Atom Syndication Format. Within this case study, I’ll illustrate that big data is not only composed of terabytes of information, but it can also come in a variety of structures and formats.

In research and case studies chronicling the integration of data and databases, problems with schema matching are consistently encountered. Schema matching is the process of mapping fields that share the same properties to one another. Even though the process can be automated, optimal results require thoughtful human arbitration. For example, take the integration of the following three raw feed snippets, and how we merged them and reconciled their similarities and differences.

Raw Feeds:

moreover

<id>http://c.moreover.com/blog-1000</id>
<title>The Data Era-Moving from 1.0 to 2.0</title>
<author><name>Infochimps Blog</name><url>http://blog.infochimps.com</url></author>
http://shop.oreilly.com/product/0636920010203.do<link rel=”alternate” href=”http://c.moreover.com/blog-1000″/>
<summary>…I describe it as Big Data 1.0 versus Big Data 2.0.</summary>
<modified>2012-08-28T20:23:00Z</modified>
<issued>2012-08-28T20:23:00Z</issued>

twitter

{“id”=>”tag:search.twitter.com,2005:220000000″,
“objectType”=>”activity”,
“verb”=>”post”,
“postedTime”=>”2012-08-16T22:12:24.000Z”,
“provider”=>{“objectType”=>”service”,”displayName”=>”Twitter”,
“link”=>”http://www.twitter.com”},
“link”=>”http://twitter.com/infochimps/statuses/2200000000000000000″,
“body”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“object”=>{“objectType”=>”note”,
“id”=>”object:search.twitter.com,2005:220000000″,
“summary”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“link”=>”http://twitter.com/infochimps/statuses/220000000″

facebook

<id>50000_30000000</id>
<created>2012-07-27T21:29:13+00:00</created>
<published>2012-07-27T21:29:13+00:00</published>
<updated>2012-07-27T21:29:43+00:00</updated>
<title>Infochimps posted a bookmark to Facebook</title>
<category term=”BookmarkPosted” label=”Bookmark Posted”/>
<link rel=”alternate” type=”html” href=”http://www.facebook.com/50000/posts/30000000″/>
<service:provider>
<name>Facebook</name>
<uri>www.facebook.com</uri>
<icon/>
</service:provider>
<activity:object>    <activity:object-type>http://activitystrea.ms/schema/1.0/bookmark</activity:object-type>
<id>50000_30000000</id>
<title>Welcome Jim Kaskade, Infochimps’ new CEO`</title>
<subtitle>infochim.ps</subtitle>
<content>Our vision for Infochimps leverages the power of Big Data….</content>
<summary>It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…</summary>
<link rel=”alternate” type=”html” href=”http://www.facebook.com/50000/posts/30000000″/>
</activity:object>

Looking at the snippets above, a computer would most likely match the title in Moreover and Facebook to the title in schema.org. This seems like the right thing to do, right? No, it’s wrong. The Mapping chart below and the snippets above illustrate the heart of the mapping process: taking raw data and making sense of it.  

This is the kind of craziness you might encounter:

  • In Moreover, the title holds the name of the blog entry: “The Data Era-Moving from 1.0 to 2.0
  • In Facebook,
    • The top-level “title” is the name of the activity: “Infochimps posted a bookmark to Facebook”, “Infochimps posted a note to Facebook”, or “Infochimps posted a photo to Facebook”
    • If someone posted a link, the “title”, one level down (in Activity:Object.title), is the name of the link, “Welcome Jim Kaskade, Infochimps’ new CEO“; the case is different for a photo and for note.
  • Meanwhile in the Twitter-ville stream, the idea of a “title” does not even exist

Mapping Chart
 Social Media Schema Mapping: Increasing the Power of Data

Unified Schema:

moreover

“id”=>”http://c.moreover.com/blog-1000″,
“name”=>””,
“description”=>””,
“date_published”=>”2012-08-28T20:23:00Z”,
“title”=>”The Data Era-Moving from 1.0 to 2.0″,
“link”=>”http://c.moreover.com/blog-1000″,
“text”=>”…I describe it as Big Data 1.0 versus Big Data 2.0.”,
“provider”=>”Infochimps Blog”,
“author”=>{“name”=>””, “url”=>””},

twitter

“id”=>”tag:search.twitter.com,2005:22000000″,
“name”=>”twitter_activity”,
“description”=>””,
“date_published”=>”2012-08-28T22:12:24.000Z”,
“title”=>””,
“link”=>”http://twitter.com/infochimps/statuses/22000000″,
“text”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“provider”=>{“name”=>”Twitter”, “url”=>”http://www.twitter.com”},
“author”=>{“name”=>”Infochimps”, “url”=>”http://www.twitter.com/infochimps”}

facebook

“id”=>”50000_30000000″,
“name”=>”bookmarkposted”,
“description”=>”Our vision for Infochimps leverages the power of Big Data…
“date_published”=>”2012-07-27T21:29:13+00:00″,
“title”=>”Welcome Jim Kaskade, Infochimps’ new CEO“,
“link”=>”http://www.facebook.com/50000/posts/30000000″,
“text”=>”It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…”,
“provider”=>{“name”=>”Facebook”, “url”=>”http://www.facebook.com”},
“author”=>{“name”=>”Infochimps”, “url”=>”https://www.facebook.com/infochimps”}

To create the unified schema, I followed the vocabulary and structure for CreativeWork from schema.org.  The six feeds were molded around those properties, harking back to another project I worked on, the Infochimps Simple Schema (ICSS). ICSS was specifically developed to integrate different types of data such as Twitter, Foursquare, Weather data, and Wikipedia. After matching data, I omitted redundant data that would hinder the formation of a streamlined schema.

 Social Media Schema Mapping: Increasing the Power of Data

In addition to the semantic unification, was the syntactic unification. We found JSON to be the best lingua franca for data exchange. Some of the data was XML-based, which implies complex processing. This was a relatively fast process, not directly as a result of our tools, but also because of the Moreover and Gnip structures. Due to their tidy schemas, we were allowed to use a simpler library – in Ruby, we use Crack; anything in the XML::Simple family would work. With gorillib/model available through Gorillib library, my life was easier, turning raw documents into active intelligent code objects instead of passive bags of data.

This case study illustrates how easily data value can get lost when working with diverse data sources. Most importantly, it highlights the benefits of successfully solving the inherent challenges and the variety of tools and expertise necessary to do so. Merging six different schemas into one semantically-consistent structure dramatically increases the power of data. When data is unified, effective data integration and processing is possible. A recent blog post by our CEO Jim Kaskade, further highlights the advantages of unifying and integrating data: Big Data Means Leveraging All Customer Channels.

blog platform demo v21 Social Media Schema Mapping: Increasing the Power of Data

 

Comments

  1. Data Processing September 13, 2012 at 2:16 am

    Hi,
    Thanks for your correct informations,you are providing some good helpful informations. Thanks a lot .