Jim Keynotes Strata Conference + Hadoop World

Strata Hadoop 300x76 Jim Keynotes Strata Conference + Hadoop World

Everyone’s talking about Big Data, but who’s actually doing it right?

Located in New York, NY from Oct. 28-30, Infochimps will be going big at Strata + Hadoop World, along side thousands of the best minds in data gathering to learn, connect, share knowledge, and explore.

Easily the biggest show of the year for us, we are excited to announce that Infochimps CEO Jim Kaskade will be keynoting Wed, Oct. 30 at 9:50am EDT in the Grand Ballroom.

LEARN 300x77 Jim Keynotes Strata Conference + Hadoop World



Jim’s not the only face from Infochimps going to the show; we’ll be exhibiting with a packed booth (Booth #38) full of eager chimps ready to talk about Big Data. Key exhibiting team members include our VP of Sales Burke Kaltenberger, Director of Marketing Amanda McGuckin Hager, Director of Product Tim Gasper, Director of Sales Strategy and Operations Ryan Miller, and Demand Gen Manager Caroline Lim. If you’re going to Strata + Hadoop World and would like to set up a meeting with a chimp, we’d love to chat.

CONTACT 300x78 Jim Keynotes Strata Conference + Hadoop World

Not registered? Register today, save 20% with discount code: INCHP, and be sure to stop by Booth#38 to chat with us about Big Data.

119efc1b cf09 4f4f 9085 057e76e0464c Jim Keynotes Strata Conference + Hadoop World

Part 2: The Lucky Break Scoreboard

Last week, Infochimps CTO Flip Kromer introduced his truth on the failures that led to the successful acquisition by CSC in his blog post, Part 1: The Truth – We Failed, We Made Mistakes.  Flip continues his blog series with Part 2, his love letter – the real Infochimps story.


7 years ago, having switched majors from Computer Science in college to Physics in grad school, and failing twice to successfully execute a plan of research in Physics, I decided to switch to Education – my favorite part of grad school was teaching. A year before, my ever-patient advisor, physics professor Mike Marder, had started a wildly successful alternative program for a public-school teaching certification. It replaced a full general education curriculum with frequent in-classroom experience and focused education classes  – and it let me reuse the scientific coursework I already had way too much of.

A year later, I was near the end of the program and preparing my teaching portfolio, which led me to spend a lot of time thinking about what I wanted my students to learn, and why. For many of them, my course would be their last formal chance to acquire the skill of quantitatively understanding their universe. As I started to write (less bluntly), I had no interest in burdening them with three different forms of the quadratic equation, or pretending that as a practicing physicist I’d ever used the formula for the perimeter of a trapezoid.

What they should be learning was the ability to make use of a complex information stream, understand sophisticated information displays, and extract straightforward insight using tools such as … … ‽‽

I paused, struck, mid-sentence. Those tools do not exist. Not for a high school student, not for a domain expert in another field, and only after years of study, for me. That’s what I was supposed to be working on: democratizing the ability to see, explore and organize rich information streams.

So as a lapsed computer scientist and failed physicist, I decided to abandon education as well and start yet a different new thing, one that was none of those and all of those together.

Challenge Accepted

I asked Mike Marder if I could come back to his research group and work on tools to visualize data; we could figure out along the way how to tie it into a research plan. I had some savings (thanks largely to my Grandmother, who was just your typical successful 1940’s woman entrepreneur), so I wouldn’t cost him any money. Mike reasoned that although I didn’t know how to solve my own problems, I was frequently useful in helping others solve theirs — and who knows, I seemed really fired up about this new idea whatever it was. So all in all it was an easy decision to hide me away in a shared office and let me get to work.

Building the visualization tool required demonstration data sets to prove the concept, and there are few better than the ocean of numbers around Major League Baseball.

In addition to the retrosheet project — the history of every major-league baseball game back to the 1890s — MLB.com was publishing one of the most remarkable data sets I knew of. For the past seven years, it gives every single game, every single at-bat, every single play, down to the actual trajectory of every single pitch. I first started playing with the retrosheet data, and found some scattered errors — things like a game-time wind speed of 60mph.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support.)

Weekend Project Gone Awry

Well, the NOAA has weather data. Lots of weather data. The hour-by-hour global weather going back 50 years and more, hundreds of atmospheric measurements for every country in the world, free for the asking. And the Keyhole (now Google Earth) community published map files giving the geolocation of every current and historical baseball stadium.

So if you’re following, we have:

  • A full characterization of every game event
  • … including the time of the game and the stadium it was played in,
  • … and so using the stadium map files, the event’s latitude and longitude
  • … and using that lat/long, all the nearby weather stations
  • … and using the game date and time, the atmospheric conditions governing that event

I connected the data sets looking to correct and fill in the weather data, and found out I accidentally wired up a wind tunnel. There’s no laboratory with the budget to have every major league pitcher throw thousands of pitches for later research purposes — none, except the data set I described.

What’s screwy (and here’s where every practicing data scientist groans and shakes their head) is that the hard part wasn’t performing the analysis. The hard parts were a) making that data useful, and b) connecting the data sets, making them use the same concepts and measurement scales.

But all that work — the mundane, generic work anybody would have to do — just sat there on my hard disk. If I created a useful program, or improved an existing public project, I knew right where to go: open-source collaboration hubs like sourceforge or github. But no such thing existed for data. I had to spend weeks transforming the MLB game data into a form that you could load into a database. If we could avoid that repetition of labor, we would solve the problem of every practicing data scientist.

On Christmas Day 2007, I bought a book on how to build websites using the “Ruby on Rails” framework, and figured I’d knock something useful out in, y’know, a week or so. By sometime that Spring, I had something useful: a few interesting data sets and a website to generically host and describe any further data sets. The initial version of the site was read-only, because I didn’t know how to do join models or form inputs in Ruby on Rails, but I could add new data sets directly to the database. And just like that, Infochimps was born.

I cold-emailed blogger Andy Baio, who linked to “Infochimps, an insane collection of open datasets”. For a guy working alone in an ivory tower, the resulting response was overwhelming.

One of the individuals who emailed to encourage us was Jeff Hammerbacher, founder of the data team at Facebook. Chatting on the phone with him, he told me about a new data analysis tool that Facebook was using, called Hadoop. I looked into it, but couldn’t see how I would ever need to use it. Still, it was really exciting that big names in data were taking interest.

On a trip to San Francisco a few weeks later, I went to a meetup at Freebase. @skud, their community manager, recognized that Infochimps was the perfect raw-data complement to Freebase. She asked me to come back the next month and give a meetup talk. Kurt Bollacker, head of their data team (and future teammate and profoundly valuable mentor), asked me to come back the next day and give an internal lunch lecture. I stayed up all night using google docs on my uncle’s powerpoint-less computer, and gave some hot mess of a presentation to their internal group. Kirrily didn’t uninvite me, so it wasn’t too bad.

It was clear that the lack of a collaboration hub was a problem many people were feeling.

So as a lapsed computer scientist, failed physicist, and no-show educator, I decided to abandon working on a visualization tool and make a collaboration hub instead. Yup.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it.)


One of the new faces on Mike’s research team when I returned was Dhruv Bansal, who was working on a fascinating problem bridging Mike’s two interests: physics and education. They used a freedom-of-information request to acquire a fascinating data set: the anonymized test scores for every student, on every question, for the yearly exam taken by every schoolchild in Texas.

They used the physics equations for fluid flow to model the year-on-year change in student test scores, highlighting patterns that demanded immediate action within the education community.

As you can guess again, the costliest part of that project was not performing the analytics; or applying the Fokker-Planck equation for fluid-flow; or working the paper through peer review. No, the costliest part of the project was the 3-month process of acquiring the data and cleaning it for use. For the random researcher who discovered and requested the data, Dhruv would spend a few hours burning the data to a DVD and physically mail a copy. For reasons I still don’t understand, while researchers in Sociology, Psychology, other “soft” sciences immediately latched on to the usefulness of Infochimps from the very start, Physicists and Computer Scientists almost never understood what we were doing or why it might be valuable. Dhruv and Mike’s split focus meant they got it immediately.

This is probably the most unlikely lucky break, and most crucial development, of this adventure: sitting a few offices away from where I worked was one of the most talented programmers I’ve ever worked with, possessed with a mountainous drive to change the world, the laconic cool to keep me level, and a furious anger at the same exact problem I was working to solve.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv.)

Twitter Dreams

At around this time Twitter was blowing up in popularity, though still a tool largely used by nerds to tell each other about what they had for lunch. We couldn’t explain, any more than most, the appeal of Twitter a social service.

But to 2 physicists with a background in the theory of random network graphs, Twitter as a data set was more than a social service, it was a scientific breakthrough. It implemented a revolutionary new measurement device, giving us an unprecedented ability to quantify relationships among people and conversations within communities. Just as the microscope changed biology, and the X-ray transformed medicine, we knew seeing into a new realm places us on the cusp of a new understanding of the human condition. Making this data available for analysis and collaboration was the best way to provide value and draw attention to the Infochimps site. We emailed Alex Payne, engineering lead at Twitter, for permission to pull in that data and share it with others. He gave me a ready thumbs-up: better that scientists download the data from us, than that they pound it out of his servers.

We wrote a program to ‘crawl’ the user graph: download a user, list their followers, download those users, list their followers, repeat. That was the easy part. Sure, each hundred followers had hundreds of followers themselves, but we could make thousands of requests per hour, millions of requests per week.

The hard part came over the next few weeks as we realized that none of our tools were remotely capable of managing, let along analyzing, the scale of data we so easily pulled in. As quickly as we could learn MySQL, the data set outgrew it. Sure, Dhruv and I could request supercomputer time for research, but supercomputers weren’t actually a good match — they’d be more like a rocketship when what we needed was a fleet of dump trucks. We realized what we needed was Hadoop, the tool Jeff Hammerbacher mentioned to me a few months earlier.

But where could we set up Hadoop? The physics department’s computers were scattered all over and largely locked down. But I also had an account on the UT Math department’s computers. Their sysadmin, Patrick Goetz, was singularly passionate about enabling researchers with the tools they needed to make breakthroughs. He took the much more courageous (and time-consuming for him) route of allowing expert users to install new software across departmental machines.

What’s more, the Math department had just installed a 70-machine educational lab. During the day, it was filled with frustrated freshman fighting Matlab and math majors making their integrals converge. From evening to 6am, however, they were just sitting there… running… inviting someone to put them to good use.

So that’s what we did; put them to good use. We set up Hadoop on each of the machines, modifying their configuration for the comparatively wussy undergrad-lab hardware, and set about using this samizdat supercluster on the Twitter user graph.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv; the explosion of social media data; the invention of Hadoop.)

Data Community

All through 2006-2009, people walking different paths — social media, bioinformatics, web log analysis, graphic design, physics, open government, computational linguistics — were arriving in this wide-open space, forming communities around open data and Big Data.

On Twitter, we were finally seeing what all the people in our favorite data set knew: a novel communication medium that enabled frictionless exchange of ideas and visible community. I’ll call out people like @medriscoll (CEO of Metamarkets), @peteskomoroch (Pricinpal Data Scientist at LinkedIn) @mndoci (Product Manager of Amazon EC2), @hackingdata (Founder of Cloudera, now professor at Mt Sinai School of Medicine), @dpatil (everything), @neilkod and @datajunkie (Facebook data team), @wattsteve (Head of Big Data at Red Hat), among dozens more. It didn’t matter if someone was a random academic, a bored database engineer, a consultant escaping one field into this new one, a big name building the core technology. When you saw a person you respected talking to a person with a good idea, you hit “follow”, and you learned. And when you heard that someone in the Big Data space wasn’t on Twitter, you harangued them until they joined. (Hi, Tom!)

Meanwhile, Aaron Swartz had started the Get.theinfo Google Group. This most minor of his contributions had a larger impact that most know, and was typical of why he’s so missed. He recognized a problem (no conversation space for open-data enthusiasts), built just enough infrastructure to solve it (a google group and a website), then galvanized the community to take over (gifting enthusiastic members with the white elephant of moderator permissions), and offered guidance to make it grow.

The relationships we built and communities we joined became critical catalysts for our growth.

Twitter Reality

We spent the next several months building out the site during the day and running analysis on the growing hundreds of gigabytes by night (does that seem quaintly small now?). Right before Christmas break, we did a set of runs producing data suitable for people in the community to find useful. Hours before hopping on the plane to visit my family, I finished compressing and uploading them, wrote up a minimal readme file, and posted a note to the Get.Theinfo mailing list. I knew the folks there wouldn’t mind the rough cut version, so I figured I’d mention it quietly there, but wait to do a proper release after break — after all, there was no internet where I’d be staying.

Well, two predictable things happened: 1) a huge response, far more than expected, flowing up the chain to large tech blogs and twitter-ers, and 2) a polite but forceful email from Ev Williams (Twitter’s CEO) asked us to take the data files down while they figured out a data terms-of-service. We reluctantly removed the data.

Sure, the experience was a partial success. It brought great publicity, and of course you probably caught the foreshadowing of how important Hadoop was about to become for us. But we failed at the important goal, sharing this immensely valuable data we invested months to release.

Minister of Simplicity

Now to introduce Joe Kelly into the story. Our research center decided to hire someone to build our new website, and one of the respondents to our Craigslist ad was Joe, a former UT business school student who had been working with his roommate to get their general contracting firm off the ground. He didn’t really know how to design websites, but he absolutely loved reading about the science our center was doing, so he applied.

His interview was amazing. He had the design sense of a paper bag compared to the other candidates, but every one of us left the room saying, “wow, that guy was awesome, the kind of person you just want to work with on a project”. Only Dhruv was smart enough to take the face-slappingly obvious next step — replying 1-to-1 to a later email from Joe to say, “well, hey, we also have this other project going on; we don’t really want need your help on the website, but there’s a lot of work to do”. Within days, Joe had set up a bank account and PO box, organized the papers to make us an official partnership, and generally turned this ramshackle project into an infant company. It was an easy decision for Dhruv and I to make him a co-founder.

An easy decision until a few days later, when I read some cautionary article about how the #1 mistake companies make is choosing co-founders hastily. Well, hell. We just made this guy we randomly met a couple weeks ago a co-founder, handing him a huge chunk of the company. I didn’t know if we just made a huge mistake or not.

So the next day, we were hanging out at the Posse East bar (our “office” for the first several months of the company), and Joe introduced us to the idea of an Elevator Pitch. “If we’re going to be at the South by Southwest (SXSW) Conference, we need to be able to explain Infochimps”. I replied with some kind of rambling high-concept noodle. Dhruv rang in with his version — more scientific, more charm and cool, but no more useful than mine.

Joe replied, “No. What Infochimps is this: ‘A website to find or share any data set in the world’”.

I rocked back in my chair and knew Dhruv and I made one of the best decisions of our lives. His version said everything essential, and nothing more. In one week, he understood what we were doing better than we did after a year. Joe’s role emerged as our “Minister of Simplicity”. He removed all complications, handled all necessary details, smoothed all lines of communications, making it possible for our team to Just Hack. Everything essential, and nothing more.

Capital Factory

With the decision to move forward as a company, not an academic project, we applied to the starting class of Capital Factory (Austin’s startup accelerator). It was an amazing experience, and we went hard at it: we hit all the meetings, spent hours working on our pitch, tried to make contact with every mentor, and made an epic application video. (One of Dhruv’s housemates was a professional filmmaker. Friends in high places.)

We got great feedback and obvious interest from the mentors, and were chosen as finalists. We were confident that we had the right combination of team and big idea to merit acceptance.

They rejected us.

After the acquisition, Bryan Menell — one of the Capitol Factory founders — posted a graciously bold blog post explaining what happened. As we later heard from several mentors, they each individually loved our company. Once in the same room though, they found that none of them loved the same company. This mentor loved Infochimps, a company that would monetize social media data. This other one loved Infochimps, a set of brilliant scientists who could help businesses understand their data. Some of them just knew we worked our asses off and were incredibly passionate about whatever the hell it is we were doing but couldn’t explain. A few of the mentors loved Infochimps because we were building something so cool and potentially huge that surely some business value would later emerge. Whichever idea a mentor did like, they generally didn’t like the others.

I can’t overstate how difficult it was to explain what we were doing back then. After two years, we can now crisply state what we had in mind: “A platform connecting every public and commercially available database in the world. We will capture value by bringing existing commercial data to new markets, and creating new data sets from their connections.” It’s easy(er) now, partly because of the time we spent to crystallize an explanation of the idea. Even more so, people now have had years of direct experience and background buzz preparing them to hear the idea. For example, the concept that “sports data” or “twitter data” might have commercial value was barely defensible then, but is increasingly obvious now.

Above all that though, the Capital Factory mentors were right: we were all those ideas, and all of those ideas were (as we’d find out) mostly terrible. And working on the combination of all of them was a beyond-terrible idea. On that point, Capital Factory was right to reject us.

We worked hard, had the perfect opportunity, and failed.

For good reasons and bad, we failed to get in, Or, well, we mostly failed to get in. Some of the mentors liked what they heard enough to stay in touch — meeting for beers and advice, making introductions, and being generous with their time and contacts in many other ways. The Austin startup scene was about to explode, led by Joshua Baer, Jason Cohen, Damon Clinkscales, Alex Jones and others. The energy that the Capital Factory mentors and these other leaders put into mentoring startups like ours ricocheted and multiplied within the community, in the kind of “liquid network” that Steven Johnson writes about. Although the companies within the first CapFac class benefited the most, it was like every startup in Austin was admitted.

The Truth

On the one hand, we had a bunch of fans in blog land, some website code, and a good team. But we had no idea how to make money and a finite runway. Our most notable validation as a project was a failed effort to share data, and our most notable validation as a business was an honorable mention ribbon.

Are you seeing it?

We were experiencing success after success after success.

Every time we failed, a smaller opportunity opened: one that was sharper; one that was more real; one that brought us closer to the right leverage point for changing the world.

These opportunities were smaller, but the energy behind them was the same. We were following what inspired people — to use data sets from Infochimps, to post a data set, to join our pied-piper team, to tweet about us, to make an intro, to have coffee and teach us something. All our ideas were useless crap, except in one essential way: to gather and inspire the people who would help us uncover a few ideas that were good, and execute on them.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv; the explosion of social media data; the invention of Hadoop; the completely random intersection with Joe; starting Infochimps just as the Austin startup scene exploded.)

The 3rd part of this blog series will highlight the journey from “project that inspired people” to “business that solved a real problem” — powered by individuals who made sizable investments of time, energy, money and kindness to produce repeated successes from repeated failures, and by the early customers of Infochimps who believed in us.

As we go, that  “lucky break scoreboard” will get more and more improbable, enough to make that word “lucky” ludicrously inapplicable.

Philip (Flip) Kromer is co-founder and CTO of Infochimps where he built scalable architecture that allows app programmers and statisticians to quickly and confidently manipulate data streams at arbitrary scale. He holds a B.S. in Physics and Computer Science from Cornell University and attended graduate school in Physics at the University of Texas at Austin. He authored the O’Reilly book on data science in practice, and has spoken at South by Southwest, Hadoop World, Strata, and CloudCon. Email Flip at flip@infochimps.com or follow him on Twitter at @mrflip.

b0bae296 90b0 4bfe 8177 b5ac72be71c6 Part 2: The Lucky Break Scoreboard

Reinvent Your Business for Big Data

Infomart Reinvents its Business for Big Data with Infochimps 

Infochimps Solution 300x91 Reinvent Your Business for Big DataOver the past ten years, the media business has been turned on its head. The general shift from print to digital (and increasingly free) sources has challenged the traditional revenue model. To make matters even more complicated, the advent of social media has added a multitude of layers of interaction to digital content, making the task of determining how target audiences are responding to brand initiatives incredibly complex.

Learn how Infomart, Canada’s leading media consultancy of 25 years,  reinvented their business by transforming a legacy app with Infochimps Cloud for Big Data.

READ 300x80 Reinvent Your Business for Big Data



Other resources you may be interested in:

6fefa857 2e95 4742 9684 869168ac7099 Reinvent Your Business for Big Data

Infochimps SXSW Panels: Voting Closes Tomorrow

sxswi 2014 Infochimps SXSW Panels: Voting Closes TomorrowCalling all supporters, calling all supporters, it’s that time of year again.

SXSW Panel Voting! Voting ends tomorrow, Friday, September 6, 2013 (11:59pmCST) - Please read the panel submissions below and vote for your Chimps.

Growing an Open-Source Project: Code to Community 

  • Speaker: Infochimps CTO Flip Kromer
  • Description: How do you grow an open source project from “It’s public and has a LICENSE file” to “Caught fire; people we’ve never met commit more code than we do”?
  • We’ll explore:
    • How do you promote awareness and word-of-mouth, and foster the early community?
    • How do you navigate and balance the twin goals of production stability and community-driven features?
    • How do you ensure code quality without discouraging involvement?
    • Of the values gained from open source – free velocity, hiring, credibility, reputation, and so forth – how much tangible value are you deriving and when does that return start exceeding investment?

VOTE 300x71 Infochimps SXSW Panels: Voting Closes Tomorrow



Managing Effective Documentation Effectively

  • Speaker: Infochimps Customer Support Engineer Rachel McCuistion
  • Description: Maintaining accurate, up-to-date, and effective documentation requires time, devoted content producer(s), and expertise. The essence of a company’s documentation should not hinder but accelerate the company’s focus and productivity. We’ll discuss the importance of creating effective documentation, how to maintain a healthy lifecycle for internal and external documentation, the common pitfalls that can lead to less effective documentation, answer the most common and difficult questions, and finally introduce an effective workflow for maintaining accurate and helpful documentation including the best tools have have been proven to increase efficiency and minimize downtime.

VOTE 300x71 Infochimps SXSW Panels: Voting Closes Tomorrow

Inbound Marketing for the Lean Startup

  • Speakers:
  • Description: Lean methodology has provided a great framework for validating your business assumptions and model — by leveraging an Inbound Marketing model with Lean, you can benchmark against your hypotheses while also growing your business in real, measurable ways. Proven lean startup veterans will teach you how to set up an inbound marketing engine and use the engine to test, validate, and grow your business using Lean tools and approaches. With over a decade of experience, we will share best practices, lessons learned, and pitfalls to lookout for. This workshop will be four hours long.

VOTE 300x71 Infochimps SXSW Panels: Voting Closes Tomorrow

Thank you for all your support and we hope to talk Big Data with you at SXSW.

Image source: nibletz.com

119efc1b cf09 4f4f 9085 057e76e0464c Infochimps SXSW Panels: Voting Closes Tomorrow

Part 1: The Truth – We Failed, We Made Mistakes

announcement 240x240 Part 1: The Truth   We Failed, We Made Mistakes

As I’m sure most of you have heard, Infochimps was recently acquired by CSC, giving us the resources and mandate to build the Big Data platform of the future. This is a perfect landing for the company and our vision, and we couldn’t be more excited.

The great acquisition stories I’m familiar with have a few commonalities: the companies share a mission and vision; the acquired team works to focus their product and integrate it with the parent company’s offering; and the parent company gives them the resources to succeed without changing what enabled the acquiring team to excel.

A perfect example of this was when Apple bought Siri. At that time, Siri was a cute little iphone app, built on amazing technology and with a highly-respected engineering team behind it. Married to Apple’s powerhouse strengths and global network, the result has transformed the way people interact with machines and is a centerpiece advantage of Apple’s product. Our goal is for nothing less than a similar story within CSC.

CSC is a global corporation that provides information technology (IT) services and professional services. They employ 95,000 people globally, who create a $16B revenue stream serving governments and large enterprise. Our challenge, and we embrace it, is to provide a significant positive return even against that massive background.

We think that we can do so (as do many analysts) because the acquisition marries the signal strengths of Infochimps and CSC:

We live in the future:

  • Proven Big Data expertise and perspective on the technical landscape
  • An indelible culture and a crazy-awesome team
  • Solid open-source citizenship, as contributors to the projects we build on and stewards of well-adopted projects we’ve written

CSC lives at enterprise-scale:

  • 50+ years of expertise in big enterprise and security
  • Passion for building customer solutions and support
  • The resources a $16B revenue stream provides

CSC’s strengths address our biggest weaknesses, letting us focus on what we do best. There are no changes to the team, the culture, our Austin location, our open-source contributions, our development approach, our irreverence, our hiring standards, or our mission to make the world smarter. We’ll continue to operate independently, continue buying lunch for the office every day, and continue open-sourcing the majority of code we write.

So this is a huge win for our team, our customers, our investors, and CSC. I could finish the post right here, and all anyone would remember is that we persevered and reached this milestone through tenacious hard work and great ideas.

Well here’s the truth: The actual history of our company is one of failure after failure, costly mistakes, and multiple near-death experiences. The only reason we’ve “succeeded” is through a preposterous series of lucky breaks and kind acts. Trying to list all the people behind that hard work and those lucky breaks would be foolish. There are too many, and I’ll just offend some by omission. But if you’re reading this, you’re probably one of them; so thank you.Success Failure Part 1: The Truth   We Failed, We Made Mistakes

Now for the real story; the story you probably haven’t heard. The story to show how large the number of people making sizable investments of time, energy, money, and kindness is required to make successes out of failures and how a small favor can change the world. It’s a thank you note to those who have helped us and a love letter to other startups figuring it out as they go. It’s a reminder that this is just another chapter of Infochimps’ book, and we’re nowhere near the resolution.

Thanks and love from the co-founders and whole Infochimps team,
Flip Kromer
Infochimps Co-Founder and CTO

*Update* Flip continued this blog series with Part 2: The Lucky Break Scoreboard, where he explains “with every failure, a smaller opportunity opened: one that was sharper; one that was more real; one that brought us closer to the right leverage point for changing the world.” Read Part 2 >>

Philip (Flip) Kromer is co-founder and CTO of Infochimps where he built scalable architecture that allows app programmers and statisticians to quickly and confidently manipulate data streams at arbitrary scale. He holds a B.S. in Physics and Computer Science from Cornell University and attended graduate school in Physics at the University of Texas at Austin. He authored the O’Reilly book on data science in practice, and has spoken at South by Southwest, Hadoop World, Strata, and CloudCon. Email Flip at flip@infochimps.com or follow him on Twitter at @mrflip.

b0bae296 90b0 4bfe 8177 b5ac72be71c6 Part 1: The Truth   We Failed, We Made Mistakes

2 Tracks, 1 Conference: IE Summit in Chicago

Why should you be interested in these IE summits? Let me break it down to you visually:

IE Executive Summary 492x1024 2 Tracks, 1 Conference: IE Summit in Chicago

Be a part of the conversation about real solutions to the real problems you face every day:

Predictive Analytics Innovation Summit: November 14 – 15, Chicago, 2013
Driving Business Success Through Predictive Data Analytics

The Predictive Analytics Innovation Summit brings the leaders and innovators from the industry together for an event acclaimed for its interactive format, combining keynote presentations, interactive breakout sessions and open discussion.

Modern businesses now have access to more data on customers than ever before, the challenge remains to identify patterns in this data to drive success. Investment in predictive analytics allows organizations the opportunity to gain insight from such a valuable resource, offering a crucial advantage over competitors.

REGISTER 2 Tracks, 1 Conference: IE Summit in Chicago



Business Intelligence Innovation Summit: November 14 – 15, Chicago, 2013
Driving Business Success Through Innovative BI

The Business Intelligence Innovation Summit brings the leaders and innovators from the industry together for a summit acclaimed for its insight into business intelligence and analytics.

Effective business intelligence is central to business success. In the modern business environment technological developments and the advances of globalization have created unparalleled opportunities for businesses to expand their markets. But new opportunity has opened the door to new challenges.

REGISTER 2 Tracks, 1 Conference: IE Summit in Chicago

47f18564 d70f 4a11 b8e3 f59ec64f85aa 2 Tracks, 1 Conference: IE Summit in Chicago

Inbound + Nate Silver = Inspired Chimps

At Infochimps, we take professional growth development seriously. Last week, Infochimps gave the marketing team the opportunity to attend Hubspot’s Inbound 2013 conference in the beautiful city of Boston.

The 4-day inbound marketing conference delivered everything from product demos, educational sessions, and networking opportunities well beyond my expectations. The high level of excitement for the keynote speakers was anticipated due to the heavy lineup, but nothing could have prepare me for the inspiration each of them delivered.

Keynote speakers Inbound + Nate Silver = Inspired Chimps

Seth Godin inspired me to become someone everyone remembers when I’m out of the room, Arianna Huffington inspired me to renew myself, Scott Harrison inspired me to give back, and of course, there was Nate Silver, the statistician who has made a big name for himself in the Big Data space, inspired me to be more creative in business.

If I had to focus on one keynote speaker for this blog post, it would undoubtedly be Nate Silver. He was the perfect keynote for us Big Data marketing nerds. Famous for his predictions for the last two presidential elections through data analysis, Nate Silver explained the gap between the promise and the reality of Big Data and proposed 4 suggestions for using data to make better business decisions.

The following image was my favorite slide – representing Big Data’s challenge – compliments to Christopher Penn for capturing a better image than my own.

Big Data Issue1 Inbound + Nate Silver = Inspired Chimps

Too awed to jot down all his inspiring quotes, I tried the best I could.  Then I came across this article, “9 Inspirational Quotes from Nate Silver at HubSpot’s INBOUND 2013“, that pinpoints some spot-on incredible quotes. My favorite quote is, “if you don’t know where you are in the present, it’s hard to take quality steps toward the future.”

Nate concluded his keynote with this final slide, his final suggestion to the road of wisdom:

photo 1024x646 Inbound + Nate Silver = Inspired Chimps

Thank you Infochimps for valuing my professional growth, thank you Hubspot for a successful Inbound conference, and thank you to all the marketers who thrive to inspire each and every day.

6fefa857 2e95 4742 9684 869168ac7099 Inbound + Nate Silver = Inspired Chimps

5 Reasons to Not Care About Predictive Analytics

predictive analytics 5 Reasons to Not Care About Predictive AnalyticsTechnology: complex and alienating, or promising and fascinating?

I’ve seen plenty of people roll their eyes and give all sorts of reasons they don’t pay much attention to predictive analytics, the increasingly common technology that makes predictions about what each of us will do—from buying, thriving, and donating, to stealing and crashing your car. Here are 5 reasons to go ahead and ignore this prognostic power… or not—you may choose to pay close attention after all.

1. Predictive computers don’t affect me. Not true. You are predicted every day by companies, government, law-enforcement, hospitals, and universities. Their computers say, “I knew you were going to do that!” These institutions seize upon newfound power, predicting whether you’re going to click, buy, lie, or even die. Their technology foresees who will drop out of school, cancel a subscription or get divorced, in some cases before they are even aware of it themselves. Although largely unseen, predictive proaction is omnipresent, determining whom to call, mail, investigate, incarcerate, set up on a date, or medicate.

2. Corporations invade privacy with data and prediction. This is sometimes true. Predicting human behavior is a new “super power” that combats financial risk, fortifies healthcare, conquers spam, toughens crime-fighting, boosts sales, and wins votes. Organizations gain this power by predicting potent yet—in some cases—sensitive insights about individuals. Companies ascertain untold, private truths—Target figures out that some customers are pregnant and Hewlett-Packard deduces who’s about to quit his or her job. We must each make our own judgment about judges and parole boards who rely every day on crime-predicting computers to decide who stays in prison and who goes free.

3. Prediction is impossible. Not so fast. Nobody knows the future, but putting odds on it to lift the fog just a bit off our hazy view of tomorrow—that’s paydirt. Organizations win big by predicting better than guessing, and they are continually cranking up the precision of predictive technology. Per-person prediction is the key to driving improved decisions, guiding millions of per-person actions. For healthcare, this saves lives. For law enforcement, it fights crime. For business, it decreases risk, lowers cost, improves customer service, and decreases junkmail and spam. It was a contributing factor to the reelection of the U.S. president. Predictive analytics is one of this century’s most important emerging applied sciences.

4. Science is boring—I drive a car but I don’t care how it works. Think again. Cars are simple: little explosions push them. But a computer that learns to predict? That’s a conceptual revolution. There’s an inevitable parallel to be drawn between how a computer learns and how a person learns that only gets more interesting as you examine the details of the machine learning process. It gets even more exciting when you see the heights this technology can reach, such as that achieved by IBM’s Watson computer, which defeated the all-time human champions on the TV quiz show Jeopardy! by “predicting” the answer to each question.

5. I hate math. That’s OK. You don’t need formulas to see how this fascinating science works. Predictive analytics learns by example. The process is not so mysterious: If people who go to the dentist most often pay their bills on time, this factoid is noted and built upon to help predict bill payments. At its core, this technology is intuitive, powerful and awe-inspiring—learn all about it!

Eric Siegel, Ph.D., is the founder of Predictive Analytics World (www.pawcon.com)—coming in 2013 and 2014 to Toronto, San Francisco, Chicago, Washington D.C., Boston, Berlin, and London—and the author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (February 2013, published by Wiley). For more information about predictive analytics, see the Predictive Analytics Guide (www.pawcon.com/guide).

b0bae296 90b0 4bfe 8177 b5ac72be71c6 5 Reasons to Not Care About Predictive Analytics

Infochimps, a CSC Company = Big Data Made Better

Post by Jim Kaskade, CEO

What’s a $15B powerhouse in information technology (IT) and professional services doing with an open source-based Big Data startup?


It starts with “Generation-OS”. We’re not talking about Gen-Y or Gen-Z. We’re talking Generation ‘Open Source’. announcement 240x240 Infochimps, a CSC Company = Big Data Made Better

Massive disruption is occurring in information technology as businesses are building upon and around recent advances in analytics, cloud computing and storage, and an omni-channel experience across all connected devices. However, traditional paradigms in software development are not supporting the accelerating rate of change in mobile, web, and social experiences. This is where open source is fueling the most disruptive period in information technology since the move from the mainframe to client-server: Generation Open Source.

Infochimps = Open Standards based Big Data

Infochimps delivers Big Data systems with unprecedented speed, scale and flexibility to enterprise companies.  (And when we say “enterprise companies,” we mean the Global 2000 – a market in which CSC has proven their success.) By joining forces with CSC, we together will deliver one of the most powerful analytic platforms to the enterprise in an unprecedented amount of time.

At the core of Infochimps’ DNA is our unique, open source-based Big Data and cloud expertise. Infochimps was founded by data scientists, cloud computing, and open source experts, who have built three critical analytic services required by virtually all next-generation enterprise applications: real-time data processing and analytics, batch analytics, and ad hoc analytics – all for actionable insights, and all powered by open-standards.

CSC = IT Delivery and Professional Services

When CSC begins to insert the Infochimps DNA into its global staff of 90,000 employees, focused on bringing Big Data to a broad enterprise customer base, powerful things are bound to happen. Infochimps Inc., with offices in both Austin, TX and Silicon Valley, becomes a wholly-owned subsidiary, reporting into CSC’s Big Data and Analytics business unit led by Sashi Reddi, VP and GM.

The Infochimps’ Big Data team and culture will remain intact, as CSC leverages our bold, nimble approach as a force multiplier in driving new client experiences and thought leadership. Infochimps will remain under its existing leadership, with a focus on continuous and collaborative innovation across CSC offerings.

I regularly coach F2K executives on the important topic of ”splicing Big Data DNA” into their organizations. We now have the opportunity to practice what we’ve been preaching, by splicing the Infochimps DNA into the CSC organization, acting as a change agent, and ultimately accelerating CSC’s development of its data services platform.

Infochimps + CSC = Big Data Made Better

I laugh many times when we’re knocking on the doors of Fortune 100 CEOs.

“There’s a ‘monkey company’ at the door.”

The Big Data industry seems to be built on animal-based brands like the Hadoop Elephant. So I keep running with the animal theme, by asking C-levels the following question when they inquire about how to create their own Big Data expertise internally:

“If you want to create a creature that can breathe underwater and fly, would it be more feasible to insert the genes for gills into a seagull, or splice the genes for wings into a herring?”

In other words, do you insert Big Data DNA into the business savvy with simplified Big Data tools, or insert business DNA into your Big Data-savvy IT organization? In the case of CSC and Infochimps, I doubt that Mike Lawrie, CSC CEO, wants to be associated with either a seagull or a herring, but I do know he and his senior team are executing on a key strategy to become the thought leader in next-generation technology, starting with Big Data and cloud.

Regardless of your preference for animals (chimpanzees, elephants, birds, or fish), the CSC and Infochimps combination speaks very well to CSC’s strategy for future growth with Big Data, cloud, and open source. At Infochimps, we look forward to leveraging CSC’s enterprise client base, industrialized sales and marketing, solutions development and production resources to scale our value proposition in the marketplace.

“Infochimps, a CSC company, is at the door.”

Jim Kaskade


Infochimps, a CSC Company





Announcing Infochimps Cloud 3.2

AllCloudServices 1024x515 Announcing Infochimps Cloud 3.2

Moving petabytes or even hundreds of terabytes of data to the public cloud can be costly and time consuming work. Since its conception, the goal of the Infochimps Cloud has been to provide the elasticity, scalability, and resiliency of cloud-based big data infrastructure, but in any environment you choose. That may mean the public cloud such as Amazon Web Services, but that may also mean a virtual private cloud, an outsourced data center such as Switch SuperNAP, or your own internal corporate data center.

With the latest release of the Infochimps Cloud, we’re excited to fully realize that vision of easily to moving your data analytics solution to your data, not just moving data to your analytics solution.

Full Private Cloud Support
Infochimps provides not only provides analytics cloud services, but also virtualization integration. With this newest release, the Infochimps Cloud fully integrates with VMware® vSphere®. This integration empowers customers to deploy the full Infochimps stack internally, leveraging their own data center and their own hardware, and either their own VMware software or an integrated Infochimps + VMware solution.

private cloud deploy options 1024x505 Announcing Infochimps Cloud 3.2

This virtualization integration framework, powered by Ironfan and Chef, enables the Infochimps Cloud to deploy to any data center where hardware and virtualization are available. For example, Infochimps partner Switch has a Tier 4 facility in Las Vegas with a 100% data center uptime guarantee, where Infochimps can quickly and seamlessly deploy big data solutions that have unrivaled reliability and high availability.

Ultimate in Cloud Mobility
One of the amazing differentiators of utilizing Infochimps Cloud is the concept of cloud mobility. Start in one environment, such as Amazon Web Services, to quickly build your application and provide a development and testing platform for your team. At any time, you can quickly migrate both your cloud services infrastructure and your big data application logic to a different environment, such as SuperNAP or your internal data center, for your final production application.

cloud hybrid and migration 1024x201 Announcing Infochimps Cloud 3.2

This is enabled by both the Ironfan homebase and application Deploy Pack frameworks, which provide folder structures to encapsulate your infrastructure and application code, and seamlessly allow them to plug into different hardware and different cloud services nodes respectively.

While this capability makes a lot of sense for applications that have sensitive data or security concerns, this also is extremely useful when customers want to get started as quickly as possible. Infochimps can turnover a completely configured Amazon Web Services environment in just a few days, developers and analysts can begin cranking away, and simultaneously a data center environment can be prepped for the eventual second stage of infrastructure deployment.

Improved Developer and Data Scientist Tools
Also major improvements have been made the user experience in working with Infochimps Cloud platform.

Wukong 3.0 is the latest DSL and command line toolkit for rapid big data application development:

  • Updated wukong-hadoop for writing Hadoop Streaming jobs with simple micro-scripts
  • All new wukong-storm for taking your Wukong flows (stitching together data sources, “processors,” and data destinations) and deploying them as Storm topologies
  • All new wukong-deploy for quickly generating Deploy Packs for encapsulating your application logic that can be tested locally, then be deployed to your Infochimps Cloud solution

The Infochimps Cloud API has been enhanced for more cross-platform functionality:

  • Unified monitoring metrics are available for understanding what is happening within the platform
  • It’s even simpler to store configuration values and settings, which can be utilized by any of your applications across the various Infochimps cloud services

To learn more about the Infochimps Cloud and the latest enhancements, request a demo today.