Monthly Archives September 2012

Technical White Paper: Big Insights from Big Data

Curious about the leading technology behind the Infochimps™ Platform?

Download our free technical white paper and gain big insights from big data.

Infochimps Platform Technical Overview
The Infochimps Platform is an integrated solution set that makes it easy, fast and simple to perform big data analytics and create big data applications. It’s a collection of open source and proprietary software for big data processing, data collection and integration, data storage, data analysis and visualization, and infrastructure management. Coupled with our expert team and a revolutionary approach to tying it all together, we help you accelerate your big data projects.

Big Data Platform1 1024x682 Technical White Paper: Big Insights from Big Data

This technical overview will explore in more detail these key areas:

Data Delivery Service™

  • Collect Data
  • Perform Stream Processing with Decorators

Data Management

  • Query Data and Build Applications

Cloud Hadoop

  • Perform Hadoop Processing

Download the white paper here to take a deep dive into the leading technology behind the Infochimps Platform.

DeepDive 728px v3 Technical White Paper: Big Insights from Big Data


5 Questions Not Every CEO Would Answer: Meet Jim Kaskade

Jim Kaskade1 300x292 5 Questions Not Every CEO Would Answer: Meet Jim KaskadeAs I’m sure you’ve heard, Jim Kaskade is the new Infochimps CEO. You’ve read about his vision for the company and his passion and experience in Big Data. But do you know him on a personal level? See the following interview questions and get a dose of the real Jim Kaskade.

1) What brought you to Infochimps?

The people.  My first question to Joseph Kelly was about the Infochimps culture and what made it so special. The value of a company is its people. Without an A team, even the best vision cannot successfully execute.  I loved all the little things. From the data mine behind closed doors, to the significance behind the Infochimps name itself – the infinite monkey theorem, everything added up to a winning culture.

2) What are your plans for Infochimps’ future?

I want Infochimps to leave a legacy. I want us to make a huge impact in the data infrastructure space and become a key player in the infrastructure transformation with our big data platform. Infochimps will make our customers’ lives easier by expanding infrastructure capabilities to Fortune 500 companies.

3) Tell me more about the recurring theme of “no more data scientists.” What are your thoughts on this controversial statement? Can you elaborate more on this concept?

The “no more data scientists” position is not meant to be literal, but instead is meant to challenge the status quo. What would you do if you didn’t have a data scientist, an 18 person IT department, or your smart statisticians? Those are the “what if” questions we’re trying to ask here at Infochimps. We’re not trying to replace data scientists; we’re trying to make their job easier. If we could make it easier for data scientists to achieve gold nuggets of brilliance and seamlessly put it in a process where we can accelerate the development, doesn’t everyone win? What brings organizations together are integrated solutions and out-of-the-box thinking like we’re offering at Infochimps.  That’s what “no data scientists” truly means – creating more data centric people all the way from IT to the CEO.

4) How would you describe your leadership style?

Empowerment. I define good leadership as someone who teaches people how to maximize their strengths and empower them to do the best they can.  My job is to make everyone in the company successful which translates into mentoring, challenging, and magnifying their strengths. As any good CEO, I make it a personal goal to help set the strategy, help create the vision, hire people smarter than all of us, focus on removing the obstacles, and help us all execute.

What’s my mantra? Work hard, play hard. It seems cliché, but it’s the simple truth.  If we’re not having fun, why are we doing it? I believe a company should work cohesively as a team to reach a common goal, overcome weaknesses, and help eachother excel to meet the next level.

5) Personal level: What are your personal ethics and how does it reflect in your work?

I am a “glass half full” kind of person. I have 2 young boys and I teach them to be curious, to always ask questions, and know there’s nothing they can’t accomplish. Key values I bring from home to work are: you can’t fly through life solo, you need people in your life you can trust; make an effort to have a mentor. If you get caught up trying to solve every problem on your own, you’ll take longer, fail harder, and then be lonely.

There are 3 men I’ve learned to respect most in my career:

    1. Art Collmeyer, the Founder of iWatt, used to say, “You gotta die before you go to heaven. It’s hard work, but if it was easy, everyone would be doing it; so suck it up.”  
    2. Bob Adams, known for heading up Xerox Ventures and thought leader in disruptive technologies, is a man of common sense. Some startups get caught up, ignore the facts and lose common sense; they don’t respond fast enough to things that aren’t working. Bob would say, “Black is black, a spade is a spade. If it’s not working, acknowledge it, fix it, and don’t ignore it with the hope it goes away.”
    3. Jack Shemer, the Founder of Teradata and the most important person in my career, had an appetite for going towards the seemingly impossible. He taught me everything about how important people are and why he puts “people in front of everything.” Jack is someone who has the softest heart, the strongest push, and mastered how to make things happen. I hope I will amount to a fraction of his success.

Thank you for sharing more about yourself, Jim. We are happy to have you on board!

Thought Leadership Webinar   Register Today 5 Questions Not Every CEO Would Answer: Meet Jim Kaskade

Predictive Analytics, Business Intelligence, Data Scientist: Pick Your Summit

IE Group1 Predictive Analytics, Business Intelligence, Data Scientist: Pick Your Summit

Does your job include responsibilities such as predictive analytics, data mining, business intelligence, or big data? Are you going to be in the Windy City this November?

Upcoming IE. Group summits in Chicago you might be interested in:

Predictive Analytics Innovation Summit: November 15 – 16, Chicago, 2012
Driving Business Success Through Predictive Data Analytics

The Predictive Analytics Innovation Summit brings the leaders and innovators from the industry together for an event acclaimed for its interactive format, combining keynote presentations, interactive breakout sessions and open discussion.

Modern businesses now have access to more data on customers than ever before, the challenge remains to identify patterns in this data to drive success. Investment in predictive analytics allows organizations the opportunity to gain insight from such a valuable resource, offering a crucial advantage over competitors.

Register Online Today!

Business Intelligence Innovation Summit: November 15 – 16, Chicago, 2012
Driving Business Success Through Innovative BI

The Business Intelligence Innovation Summit brings the leaders and innovators from the industry together for a summit acclaimed for its insight into business intelligence and analytics.

Effective business intelligence is central to business success. In the modern business environment technological developments and the advances of globalization have created unparalleled opportunities for businesses to expand their markets. But new opportunity has opened the door to new challenges.

Register Online Today!

Chief Data Scientist Summit: November 14, Chicago, 2012
Driving Business Success Through Data Analytic

As organizations focus more energy into their analytics departments the demand for data scientists and skilled statisticians grows; but with a limited talent pool, those who already have the skills are being required to take the lead. As the industry develops and the need for more data scientists and understanding in new techniques increases, it becomes essential for the analytics community to share expertise and best practices in order to foster innovation.

The Chief Data Scientist Summit in Chicago offers the perfect environment for discussion and thought-sharing amongst industry experts, combining insightful keynote presentations and deep-dive roundtable discussions. This is the must attend event for analytics and data science executives to gain and share insight in a rapidly evolving field.

Register Online Today!

Thought Leadership Webinar   Register Today Predictive Analytics, Business Intelligence, Data Scientist: Pick Your Summit


5 CloudCon Sessions: Join Jim Kaskade

CloudCon Jim Kaskade 5 CloudCon Sessions: Join Jim Kaskade

Are you attending CloudCon?

If you are, you should consider joining our CEO, Jim Kaskade, at one of his 5 speaking opportunities! Hope to see you there.

See the following schedule:

Closing Keynote by Jim Kaskade (CEO, Infochimps): Infinite Monkey Theorem
Tuesday October 2, 2012 | 5:00pm – 5:30pm @ Grand Ballroom
Acting on his passion for data, which began during a 10 year tenure at Teradata, Jim provides an energetic, inspiring, and practical perspective on why Big Data is disrupting. It’s more than historic data analyzed on Hadoop. It’s also more than real-time streaming data stored and queried using NoSQL.

Power Panel 1: Big Data Warehouse vs. Enterprise Data Warehouse
Wednesday October 3, 2012 | 10:30am – 11:00am @ Grand Ballroom
Traditional enterprise data warehousing and Hadoop/Big Data are like apples and oranges – the well-known and trusted approach is being challenged by a zesty newcomer. Is there room for both? How will these two very different approaches co-exist?

Panel Members:

Power Panel 2: Hadoop as a Service Moderated by: Jim Kaskade
Wednesday October 3, 2012 | 11:00am – 11:30am @ Ballroom West
With the power of the Hadoop technology stack, Fortune 1000 companies are rushing to establish their own Big Data presence. Part of this requires moving the Hadoop applications to the data, given the complexities associated with data governance as well as data volume. This is opening up opportunities for Top-tier datacenter providers to begin supporting Big Data as a Service or Hadoop as a Service.

Panel Members:

Power Panel 3: Where’s the Big Data Talent? Moderated by: Jim Kaskade
Wednesday October 3, 2012 | 11:30am – 12:00pm @ Grand Ballroom

Panel Members:

OpenStack: Next Generation Cloud Computing Platform Given by: Jim Kaskade
Wednesday October 3, 2012 | 2:00pm – 3:00pm @ Filmore
OpenStack is no doubt the next generation open source platform. However, the implications of choosing OpenStack go well beyond a software choice. This presentation will discuss the opportunities and go-to-market approach available to CSPs selecting OpenStack. It will also present the OpenStack Cloud Service Provider Life Cycle, a framework that takes a service provider from market assessment through implementation and on to market penetration.

CloudCon Expo & Conference brings you the opportunity to learn best practices and strategies for Cloud Deployment. A perfect event designed for IT professionals and decision makers looking to implement Cloud Technology to achieve benefits like Reliability, Adaptability and Cost Reduction.

High Speed Retail Analytics 5 CloudCon Sessions: Join Jim Kaskade


Intel + Infochimps: Intel Developer Forum 2012

Intel Developer Forum Booth1 300x224 Intel + Infochimps: Intel Developer Forum 2012

Last week, Dhruv Bansal and I attended the Intel® Developer Forum (IDF) 2012 in San Francisco. IDF brings together Intel‘s ecosystem of partners, vendors, and customers  to showcase the newest Intel creations and bleeding-edge applications of Intel tech.

Some mind-blowing demonstrations included the next-gen intelligent home concept where appliances, electronics, and home infrastructure all communicate with one another, dynamic digital signage systems that cater to specific demographics with targeted digital advertising, and more.

Dhruv Bansal and Intel’s Clive D’Souza gave a great in-depth presentation entitled “Taming the Big Data Tsunami Using Intel Architecture“ on the collision between Big Data software and top of the line datacenter hardware. Dhruv walked through a demo of the Infochimps Platform software running on Romley Intel architecture, then Clive drove an in-depth walk-through of the hardware aspects.

Infochimps and Intel shared a partner booth showcasing Infochimps software married with Intel server architecture. In one of the photos below, you can see Dhruv standing next to the Romley server chassis! Overall, the conference provided great conversations, an opportunity for software guys to get a healthy dose of hardware, and a chance to show off our Big Data Platform.

Dhruv at Intel Developer Forum 224x300 Intel + Infochimps: Intel Developer Forum 2012Infochimps Intel Developer Forum 300x224 Intel + Infochimps: Intel Developer Forum 2012

Interested in learning more about Intel+Infochimps Big Data solutions? Reach out, we’d love to chat!

Big Data Platform Demo Intel + Infochimps: Intel Developer Forum 2012

Big Data as a Service at DataWeek

Big DataWeek Big Data as a Service at DataWeek

Every Fortune 500 company is attempting to understand the transformation occurring in the datacenter through the deployment of new technologies like NoSQL, NewSQL, and Hadoop. Learn how to source, implement, or monetize a data stream from executives in the Data-as-a-Service space.

Join us for our panel discussion at DataWeek 2012 in San Francisco:
Data-as-a-Service Panel
Tuesday, Sept 25 11:00a – 12pm | 111 Minna Gallery, Room 1, The SPUR Urban Center

Learn how these technologies are making enterprise data more accessible and making “data as a service” a reality. Come and hear how business analysts, application developers, data scientists can advance their efforts by leveraging in-house or cloud-based data infrastructure as a service with the following panel:

Moderator: Roger Magoulas, Director of Market Research, O’Reilly Media
Speaker: Derrick Harris, Cloud Editor and Senior Writer, GigaOM
Speaker: Kerem Tomak, VP of Marketing Analytics, Macys.com
Speaker: Mike Olson, CEO, Cloudera
Speaker: Ron Bodkin, Founder & CEO, Think Big Analytics
Speaker: Dhruv Bansal, Co-Founder & CSO, Infochimps

DataWeek 2012 Conference & Festival September 22nd – 27th in San Francisco is shaping up to be the largest SF-based data conference & festival including over 100 workshops and talks, an AngelHack Big Data hackathon,  SF Beta :: DataWeek Edition, and events throughout the week.

Get 25% off your DataWeek pass using the discount code “dw2012partner” by registering here.

High Speed Retail Analytics Big Data as a Service at DataWeek


(Video) High Speed Retail Analytics: Courtesy of a New Approach to Big Data

Our Chief Science Officer, Dhruv Bansal recently presented a webcast on how retailers who update competitive prices at the product level know the benefit behind leveraging high-speed analytics: dramatically increased bottom lines. However, with the benefit comes the cost: consumption of huge data amounts, and real-time data processing.

In this webcast, you’ll learn how retailers can leverage their own Big Data – enabling them to go from data sources to increasing profits, margins and market share – in a fraction of the time expected, and at a fraction of the cost.  Watch now and learn more about how the Infochimps Platform enables retailers to:

  • Make real-time, data driven decisions on all retail metrics
  • Ingest and process data sources from pricing, social, marketing, sales, support, etc.
  • Process, amend and augment data in real-time for high-speed analytics
  • Increase profits, margins and market share

High Speed Retail Analytics (Video) High Speed Retail Analytics: Courtesy of a New Approach to Big Data

Social Media Schema Mapping: Increasing the Power of Data

Infochimps recently developed a unified system for six different social media schemas from Gnip and Moreover. Gnip normalizes data from Facebook, Twitter, and Youtube into Activity Streams. Moreover feeds of forums, blogs, and news reports are normalized as XML in the Atom Syndication Format. Within this case study, I’ll illustrate that big data is not only composed of terabytes of information, but it can also come in a variety of structures and formats.

In research and case studies chronicling the integration of data and databases, problems with schema matching are consistently encountered. Schema matching is the process of mapping fields that share the same properties to one another. Even though the process can be automated, optimal results require thoughtful human arbitration. For example, take the integration of the following three raw feed snippets, and how we merged them and reconciled their similarities and differences.

Raw Feeds:

moreover

<id>http://c.moreover.com/blog-1000</id>
<title>The Data Era-Moving from 1.0 to 2.0</title>
<author><name>Infochimps Blog</name><url>http://blog.infochimps.com</url></author>
http://shop.oreilly.com/product/0636920010203.do<link rel=”alternate” href=”http://c.moreover.com/blog-1000″/>
<summary>…I describe it as Big Data 1.0 versus Big Data 2.0.</summary>
<modified>2012-08-28T20:23:00Z</modified>
<issued>2012-08-28T20:23:00Z</issued>

twitter

{“id”=>”tag:search.twitter.com,2005:220000000″,
“objectType”=>”activity”,
“verb”=>”post”,
“postedTime”=>”2012-08-16T22:12:24.000Z”,
“provider”=>{“objectType”=>”service”,”displayName”=>”Twitter”,
“link”=>”http://www.twitter.com”},
“link”=>”http://twitter.com/infochimps/statuses/2200000000000000000″,
“body”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“object”=>{“objectType”=>”note”,
“id”=>”object:search.twitter.com,2005:220000000″,
“summary”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“link”=>”http://twitter.com/infochimps/statuses/220000000″

facebook

<id>50000_30000000</id>
<created>2012-07-27T21:29:13+00:00</created>
<published>2012-07-27T21:29:13+00:00</published>
<updated>2012-07-27T21:29:43+00:00</updated>
<title>Infochimps posted a bookmark to Facebook</title>
<category term=”BookmarkPosted” label=”Bookmark Posted”/>
<link rel=”alternate” type=”html” href=”http://www.facebook.com/50000/posts/30000000″/>
<service:provider>
<name>Facebook</name>
<uri>www.facebook.com</uri>
<icon/>
</service:provider>
<activity:object>    <activity:object-type>http://activitystrea.ms/schema/1.0/bookmark</activity:object-type>
<id>50000_30000000</id>
<title>Welcome Jim Kaskade, Infochimps’ new CEO`</title>
<subtitle>infochim.ps</subtitle>
<content>Our vision for Infochimps leverages the power of Big Data….</content>
<summary>It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…</summary>
<link rel=”alternate” type=”html” href=”http://www.facebook.com/50000/posts/30000000″/>
</activity:object>

Looking at the snippets above, a computer would most likely match the title in Moreover and Facebook to the title in schema.org. This seems like the right thing to do, right? No, it’s wrong. The Mapping chart below and the snippets above illustrate the heart of the mapping process: taking raw data and making sense of it.  

This is the kind of craziness you might encounter:

  • In Moreover, the title holds the name of the blog entry: “The Data Era-Moving from 1.0 to 2.0
  • In Facebook,
    • The top-level “title” is the name of the activity: “Infochimps posted a bookmark to Facebook”, “Infochimps posted a note to Facebook”, or “Infochimps posted a photo to Facebook”
    • If someone posted a link, the “title”, one level down (in Activity:Object.title), is the name of the link, “Welcome Jim Kaskade, Infochimps’ new CEO“; the case is different for a photo and for note.
  • Meanwhile in the Twitter-ville stream, the idea of a “title” does not even exist

Mapping Chart
 Social Media Schema Mapping: Increasing the Power of Data

Unified Schema:

moreover

“id”=>”http://c.moreover.com/blog-1000″,
“name”=>”",
“description”=>”",
“date_published”=>”2012-08-28T20:23:00Z”,
“title”=>”The Data Era-Moving from 1.0 to 2.0″,
“link”=>”http://c.moreover.com/blog-1000″,
“text”=>”…I describe it as Big Data 1.0 versus Big Data 2.0.”,
“provider”=>”Infochimps Blog”,
“author”=>{“name”=>”", “url”=>”"},

twitter

“id”=>”tag:search.twitter.com,2005:22000000″,
“name”=>”twitter_activity”,
“description”=>”",
“date_published”=>”2012-08-28T22:12:24.000Z”,
“title”=>”",
“link”=>”http://twitter.com/infochimps/statuses/22000000″,
“text”=>”The Data Era – Moving from 1.0 to 2.0 http://bit.ly/SMGIMm“,
“provider”=>{“name”=>”Twitter”, “url”=>”http://www.twitter.com”},
“author”=>{“name”=>”Infochimps”, “url”=>”http://www.twitter.com/infochimps”}

facebook

“id”=>”50000_30000000″,
“name”=>”bookmarkposted”,
“description”=>”Our vision for Infochimps leverages the power of Big Data…
“date_published”=>”2012-07-27T21:29:13+00:00″,
“title”=>”Welcome Jim Kaskade, Infochimps’ new CEO“,
“link”=>”http://www.facebook.com/50000/posts/30000000″,
“text”=>”It’s official! Welcome Jim Kaskade, Infochimps’ new CEO…”,
“provider”=>{“name”=>”Facebook”, “url”=>”http://www.facebook.com”},
“author”=>{“name”=>”Infochimps”, “url”=>”https://www.facebook.com/infochimps”}

To create the unified schema, I followed the vocabulary and structure for CreativeWork from schema.org.  The six feeds were molded around those properties, harking back to another project I worked on, the Infochimps Simple Schema (ICSS). ICSS was specifically developed to integrate different types of data such as Twitter, Foursquare, Weather data, and Wikipedia. After matching data, I omitted redundant data that would hinder the formation of a streamlined schema.

 Social Media Schema Mapping: Increasing the Power of Data

In addition to the semantic unification, was the syntactic unification. We found JSON to be the best lingua franca for data exchange. Some of the data was XML-based, which implies complex processing. This was a relatively fast process, not directly as a result of our tools, but also because of the Moreover and Gnip structures. Due to their tidy schemas, we were allowed to use a simpler library – in Ruby, we use Crack; anything in the XML::Simple family would work. With gorillib/model available through Gorillib library, my life was easier, turning raw documents into active intelligent code objects instead of passive bags of data.

This case study illustrates how easily data value can get lost when working with diverse data sources. Most importantly, it highlights the benefits of successfully solving the inherent challenges and the variety of tools and expertise necessary to do so. Merging six different schemas into one semantically-consistent structure dramatically increases the power of data. When data is unified, effective data integration and processing is possible. A recent blog post by our CEO Jim Kaskade, further highlights the advantages of unifying and integrating data: Big Data Means Leveraging All Customer Channels.

blog platform demo v21 Social Media Schema Mapping: Increasing the Power of Data

 

Big Data Love + Upcoming Events

Last Friday, we hosted our famous Big Data Love Event at Capital Factory‘s snazzy new co-working space.

big data capital factory Big Data Love + Upcoming Eventscapital factory bookcase Big Data Love + Upcoming Events

Aside from the great view from the 16th floor of the Omni Hotel, neatly organized office space, and a secret meeting room behind bookshelves (so cool), we had the opportunity to catch up with Austin’s most successful entrepreneurs and many friends. For those of you who made it out, thank you. Hope to see you at the next event!

This week, Infochimps is presenting at Intel Developers Forum in San Francisco and moderating a panel at Big Data Innovation Summit in Boston. You can also find us at the following events coming soon:

If you’re in Austin, join us for these upcoming community events:

  • Austin R User Group, Thurs, Sept. 27: We love our local Meetup created to support and share R experience and knowledge among the Austin community
  • ATX Startup Crawl, Thurs, Oct. 11: A chance to mingle in Austin’s hottest startups’ office space, chat with some of Austin’s most renowned entrepreneurs, and drink a free beverage – all at the same time

For other Austin community events:

  • Lean Startup Machine, Fri, Sept. 21: A 3-day workshop where attendees use Customer Development and Lean Startup principles to validate an idea for a new product or service
  • Girl Hacker Drink-up, Wed, Sept. 26: An informal group of female developers in Austin who meet once a month to discuss projects, share new insights, do some coding
  • Austin CTO, Tues, Oct. 2: An opportunity for members of Austin CTO to discuss strategies and thought-leadership over dinner

Much gratitude to Joshua Baer and Capital Factory.

Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

forbes gil press jim kaskade Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

Today Gil Press, blogger at ForbesWhat’s The Big Data?, and The Story of Information, published his thoughts on an interview with our new CEO Jim Kaskade, titled “Infochimps’ New CEO on the Next Big Data Acquisition and Getting Rid of Data Scientists.”

Some quotes:

  • “CIOs are ready to embrace open source big data software and that the established IT players, lacking open source experience, will have to buy their way into the market.”
  • “As an engineer with Teradata in the 1990s, he witnessed first-hand what I call the Small Big-Data Bang and as a result, can draw interesting parallels with today’s Big Big-Data Bang.”
  • “Get rid of the data scientists? ‘The politically correct way to say it,’ says Kaskade, ‘is that I will turn your business users and application developers into data scientists…”

Read the article.

Interested in reading more about Jim’s vision of The Data Era? Jim’s first blog post with Infochimps, The Data Era – Moving from 1.0 to 2.0, provides an inside look into “why Infochimps is so well positioned to make a significant impact within the marketplace”.

See other media coverage:

Much gratitude to Gil Press and to Forbes.

blog platform demo v21 Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists