News & Interviews

Big Data Projects, Short Survey, Enter to Win

Big Data Survey Big Data Projects, Short Survey, Enter to Win

When doing Big Data projects, what do you want your execs to know? 
We value your opinion! In under 10 minutes, take this short survey for a chance to win over $500 in Amazon gift cards and one of 5 memberships to, one of the largest community-driven sites that focuses on providing relevant news and information on data management, collaboration software, development tools and cloud computing to help information technology (IT) professionals succeed in the field.

Take the Survey Here!


Thought Leadership Webinar   Register Today Big Data Projects, Short Survey, Enter to Win

5 Questions Not Every CEO Would Answer: Meet Jim Kaskade

Jim Kaskade1 300x292 5 Questions Not Every CEO Would Answer: Meet Jim KaskadeAs I’m sure you’ve heard, Jim Kaskade is the new Infochimps CEO. You’ve read about his vision for the company and his passion and experience in Big Data. But do you know him on a personal level? See the following interview questions and get a dose of the real Jim Kaskade.

1) What brought you to Infochimps?

The people.  My first question to Joseph Kelly was about the Infochimps culture and what made it so special. The value of a company is its people. Without an A team, even the best vision cannot successfully execute.  I loved all the little things. From the data mine behind closed doors, to the significance behind the Infochimps name itself – the infinite monkey theorem, everything added up to a winning culture.

2) What are your plans for Infochimps’ future?

I want Infochimps to leave a legacy. I want us to make a huge impact in the data infrastructure space and become a key player in the infrastructure transformation with our big data platform. Infochimps will make our customers’ lives easier by expanding infrastructure capabilities to Fortune 500 companies.

3) Tell me more about the recurring theme of “no more data scientists.” What are your thoughts on this controversial statement? Can you elaborate more on this concept?

The “no more data scientists” position is not meant to be literal, but instead is meant to challenge the status quo. What would you do if you didn’t have a data scientist, an 18 person IT department, or your smart statisticians? Those are the “what if” questions we’re trying to ask here at Infochimps. We’re not trying to replace data scientists; we’re trying to make their job easier. If we could make it easier for data scientists to achieve gold nuggets of brilliance and seamlessly put it in a process where we can accelerate the development, doesn’t everyone win? What brings organizations together are integrated solutions and out-of-the-box thinking like we’re offering at Infochimps.  That’s what “no data scientists” truly means – creating more data centric people all the way from IT to the CEO.

4) How would you describe your leadership style?

Empowerment. I define good leadership as someone who teaches people how to maximize their strengths and empower them to do the best they can.  My job is to make everyone in the company successful which translates into mentoring, challenging, and magnifying their strengths. As any good CEO, I make it a personal goal to help set the strategy, help create the vision, hire people smarter than all of us, focus on removing the obstacles, and help us all execute.

What’s my mantra? Work hard, play hard. It seems cliché, but it’s the simple truth.  If we’re not having fun, why are we doing it? I believe a company should work cohesively as a team to reach a common goal, overcome weaknesses, and help eachother excel to meet the next level.

5) Personal level: What are your personal ethics and how does it reflect in your work?

I am a “glass half full” kind of person. I have 2 young boys and I teach them to be curious, to always ask questions, and know there’s nothing they can’t accomplish. Key values I bring from home to work are: you can’t fly through life solo, you need people in your life you can trust; make an effort to have a mentor. If you get caught up trying to solve every problem on your own, you’ll take longer, fail harder, and then be lonely.

There are 3 men I’ve learned to respect most in my career:

    1. Art Collmeyer, the Founder of iWatt, used to say, “You gotta die before you go to heaven. It’s hard work, but if it was easy, everyone would be doing it; so suck it up.”  
    2. Bob Adams, known for heading up Xerox Ventures and thought leader in disruptive technologies, is a man of common sense. Some startups get caught up, ignore the facts and lose common sense; they don’t respond fast enough to things that aren’t working. Bob would say, “Black is black, a spade is a spade. If it’s not working, acknowledge it, fix it, and don’t ignore it with the hope it goes away.”
    3. Jack Shemer, the Founder of Teradata and the most important person in my career, had an appetite for going towards the seemingly impossible. He taught me everything about how important people are and why he puts “people in front of everything.” Jack is someone who has the softest heart, the strongest push, and mastered how to make things happen. I hope I will amount to a fraction of his success.

Thank you for sharing more about yourself, Jim. We are happy to have you on board!

Thought Leadership Webinar   Register Today 5 Questions Not Every CEO Would Answer: Meet Jim Kaskade

Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

forbes gil press jim kaskade Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

Today Gil Press, blogger at ForbesWhat’s The Big Data?, and The Story of Information, published his thoughts on an interview with our new CEO Jim Kaskade, titled “Infochimps’ New CEO on the Next Big Data Acquisition and Getting Rid of Data Scientists.”

Some quotes:

  • “CIOs are ready to embrace open source big data software and that the established IT players, lacking open source experience, will have to buy their way into the market.”
  • “As an engineer with Teradata in the 1990s, he witnessed first-hand what I call the Small Big-Data Bang and as a result, can draw interesting parallels with today’s Big Big-Data Bang.”
  • “Get rid of the data scientists? ‘The politically correct way to say it,’ says Kaskade, ‘is that I will turn your business users and application developers into data scientists…”

Read the article.

Interested in reading more about Jim’s vision of The Data Era? Jim’s first blog post with Infochimps, The Data Era – Moving from 1.0 to 2.0, provides an inside look into “why Infochimps is so well positioned to make a significant impact within the marketplace”.

See other media coverage:

Much gratitude to Gil Press and to Forbes.

blog platform demo v21 Forbes: The Next Big Data Acquisition and Getting Rid of Data Scientists

The Big Data Playbook for Digital Agencies

iStock 000007316552XSmall The Big Data Playbook for Digital Agencies

Our CEO, Joseph Kelly recently wrote a guest post for Mashable about how and why digital agencies should pursue Big Data opportunities.

For digital agencies, big data as a competitive advantage is still very nascent, somewhat terrifying, and not tangible at all. However, marketers are starting to hear that it’s the new secret sauce, and they’re scrambling to figure out how to use it. And for good reason. Given the current trajectory, there’s a large chance that big data will change the face of digital agencies in as little as five years.

If you’re a marketer, part of a digital agency, or just curious about how Big Data can and will shape the future of understanding customers, check out this article.

 The Big Data Playbook for Digital Agencies

Big Data Predictions for 2012 – Part One

A couple of days ago, O’Reilly’s Edd Dumbill took a look at hot topics in data for the coming year.  His five big data predictions for 2012 were absolutely spot on and illustrate the most important next steps in our ability to find insight in data: visualization, interactivity and abstraction.  As CTO of a tech startup at the center of a lot of these conversations, I thought I’d share some of my thoughts on the matter as well.  A full post with my 2012 big data predictions will be out in the next couple of weeks.

Moving Towards Insight, Not Data

Nobody wants data. It’s costly to store and bothersome gather, parse and make useful. What you do want is insight — the ability to see deeper and make better decisions.  However, even these tools of the future can only take your internal data so far, though. At some point, it becomes important to widen your observation field to include explanatory variables to give your data context.  In other words, at some point, it’s useful to bring in outside data to give your own internal data more shape and meaning.

table Big Data Predictions for 2012   Part One

Explanatory variables give your data context, which leads you to insight.

Data marketplaces such as Infochimps, Factual and Microsoft Azure will take off as CIOs discover that BI tools are for generating questions, not answers.  For example, pretend you are a camping equipment store based in Reno, NV with an e-commerce site.  After years of business, you understand your yearly sales cycle (peaks in the early spring and early fall), who your repeat customers are, who your local competitors are, etc.  However, you may uncover curiosities that cannot be answered by internal data: why do sales peak in early to mid-August and why do so many folks request holds during that time?  Why does your customer spread suddenly change from mostly folks in the Nevada area to folks from all over the country? Your business intelligence tools may fall short here.  Explanatory variables from outside sources, including weather, popularity of competitors (Foursquare check-ins, web traffic, etc) and major festivals in your area (oh HAI Burning Man!) may help you get to the bottom of why your store suddenly becomes flooded with neon furry leg-warmer wearing hippies around Labor Day.

In other words, your data creates questions and outside data creates context and helps you answers questions.  In short, the answer to the too much data problem is more data.

Real-time Insight

In his article, Edd talks about the idea of streaming data processing. I haven’t gotten to play with Esper yet, but Ilya Grigorik has — as he puts it, Esper lets you “store queries not data, process each event in real-time, and emit results when some query criteria is met.”  You can read his overview here.

At a lower and simpler level, we here at Infochimps LOVE Flume. It does one thing, simply and well: gets data from over here to over there, perhaps doing things to it along the way. It was developed for reliable log handling, but there are so many wonderful ways to beautifully misuse Flume:

  • We replaced a set of massive batch jobs to parse scraped data with a simple Flume decorator that turns raw JSON into database-ready records. The new data flow is not only near-real-time, but also simpler, more reliable, and more maintainable.
  • If you’re shipping your weblogs around with Flume, ten lines of Ruby and a tiny little StatsD daemon are all you need to track in real-time the rate, response code and latency of every web request. Joining the Church of Graphs has never been easier.
  • Trying to scale a distributed message queue (RabbitMQ, Resque, etc)? If you care about throughput more than latency, look carefully at Flume — you’re probably better off.

By the way, if timeseries are your bread and butter, check out our friends at Eidosearch.

Data Science Workflows and Tools for Fast Iteration

Edd describes how Hadoop’s batch-oriented processing to be sufficient for many uses, but argues that batch processing isn’t adequate for online and real-time needs.  Spark/Mesos has great promise for the kind of interactive exploration Edd describes. (It’s as close as anyone’s gotten to a REPL for big data exploration.)

At infochimps, our workflow revolves around writing short, readable scripts.  See our chapter in O’Reilly’s Hadoop the Definitive Guide for a case study on how we developed a toolkit for Hadoop streaming in a Ruby environment.

Visualization is King

The most important machine learning algorithms all, at their heart, exploit a graph of one sort or another. Yet simply seeing, in any form, a graph of significant scale is still an oafish RAM-exhausting slog. Far and away the best tool we’ve used is Gephi, an open source interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.

So, how essential is visualization, especially graph visualization, to quality data science?  LinkedIn, home of one of the top data science teams in the world, used its consistently-brilliant/outrageously-unfair hiring advantage to bring aboard Mathieu Bastian, Gephi’s author. LinkedIn derives so much value in just being able to see and understand their data better that they’re funding the full-time development of Gephi, to the benefit of us all.

On Being Wrong In Paris: Finding Truth in Wrong Answers

paris panorama On Being Wrong In Paris: Finding Truth in Wrong Answers

Here’s a problem that’s harder than it seems: Where is Paris? Any simple response proves more ambiguous and brittle than you would expect. But across an ocean of data lies a new way to discover answers, one that accommodates complexity because it is sourced in complexity.

A reasonable answer is “the political boundary of the city of Paris, France”. If you were walking through France, and were prone to making silly graphs, you might draw this plot of “How much am I in Paris?”


Flip Kromer Interview at Strata NY 2011

Check out this interview with Flip Kromer, founder and CTO (yes, that’s a typo in the O’Reilly video) at Strata NY 2011.  He shares his thoughts on how to do data science on a shoestring, the tools of the trade and the future of DaaS (data as a service).

Infochimps Acquires Keepstream

infochimpsheartskeepstream Infochimps Acquires Keepstream

We are excited to announce that Keepstream, social media curation and analytics company, will be joining the Infochimps family!  Jim England, Huston Hoburg, and myself, Tim Gasper are excited to now be Chimps. Together we’ll continue to develop exciting new data products and rock the world of Big Data.

With this announcement, we have two pieces of product news. Firstly, the Keepstream hand-curation product will slowly close down. The website will be set to “read-only mode” on September 30th. New user registration will be turned off and existing users will no longer be able to create or edit their collections. However, all collections will still be hosted online and be accessible at for viewing. So don’t worry, those links you’ve shared will still work just fine. We will be exploring options for exporting collections or integrating with similar curation services. If you’d be interested in something like this, leave me a note at tim[at]keepstream[dot]com.

Secondly, Keepstream Reports, located at, will continue forward. It has the potential to be both a more automated way to archive social media as well as a way to create beautiful, actionable social media analytics reports. It’s currently in private beta.

We are really excited about we can accomplish together, you can look forward to many exciting developments soon to come!  Also, Jim and I will be attending TechCrunch Disrupt in San Francisco next week.  If you’ll be there too, we’d love to meet you.  Just look for us in our brand new Infochimps shirts or shoot us a tweet at @JimEngland or @TimGasper.

The Howler Project: Get Loud!

howler monkey 2 300x225 The Howler Project: Get Loud!For howler monkeys, as their name suggests, loud vocal communication is an integral part of their social behavior. Howler monkeys are widely considered to be the loudest land animal and according to Guinness Book of World Records, their vocalizations can be heard clearly for 3 miles (4.8 km). Curious what they sound like? You can hear a clip of the howler monkey’s signature low growl here.

Wanna get loud?

We’re launching a beta testing program to pull in feedback from our most vocal fellow primates on our newest APIs.  We’re calling it the Howler Project and you can apply to become one of our first Howlers by clicking here.  We’ll email you periodically to let you know when we’re releasing new APIs and how to access them before we announce them to the general public.  Each API will have a short series of tests/questions we’d like you to answer, though you’re welcome to try to break it in any way you’d like.  The more feedback we get from you, the better we can make our products.  And the better our APIs, the easiest and faster it can be to start building amazing things on top of our data.  Everyone wins and you can help us get there!

So what’s in it for you, our furry friends?

Well, in true chimp spirit, if you pick bugs off our backs, you get to enjoy tasty grubs.  Help us test just one API and we’ll send you some fun chimpy swag.  Become a regular tester and the incentives grow to include free account upgrades on, sweet toys from ThinkGeek, custom Startup: The Hackering cards with your info and more.

Apply today and get loud!

If Search Still Sucks, Here’s How to Fix It

Search 150x150 If Search Still Sucks, Heres How to Fix ItIn a controversial post that was countered by Matt Cutts of Google, Michael Arrington compared the modern search experience to what it was like before Google was created, claiming, “It’s a sh*t show of layer upon layer of SEO madness vying for my click”.

“Yes, search is very hard. But Silicon Valley is really good at doing hard things,” claims Arrington.

Here at Infochimps, we have never shied away from difficult problems. We are a company of 12 people handling data stores that companies 100 times our size never see. Soon, Infochimps will launch hundreds of more data sets and data APIs. What do data APIs have to do with Google not being effective though?

In the post, Arrington describes that when he knows what he is looking for such as a vacation, he goes straight to vacation website such as TripAdvisor or Gogobot. Sites like TripAdvisor or Yelp start with data though, either through mass aggregation or subscription to expensive services. Aggregation is not easy though. It equates to hours of scraping, cleaning, parsing, and updating data. There are scraping scripts that could potentially break. It also means potentially thousands of dollars of hosting costs to house and process that data. Search is not a time effective or cheap exercise.

If you can launch an app and plug into a data API in minutes, suddenly creating targeted vertical search engines becomes easy and more affordable. You are in essence sharing data set with others, which makes creating intelligent algorithms with that data a lot easier and more affordable.

There will always be a place for a major search engine like Google. Vertical search engines are content rich and can become “the search within the search” though. We are doing what we can to make those more accessible to everyone, so hopefully we can make Mr. Arrington and others happy in their quest to find things.

By the way, we solve difficult problems here in Austin too. ;-)