Company Culture

Nothing so Practical as a Good Theory

Actionable Insight 150x150 Nothing so Practical as a Good TheoryThe most common error I have encountered among new data science practitioners is forgetting that the goal is not simply knowledge, but actionable insight. This isn’t limited to data scientists. Many analysts get carried away with the wrong metrics, tracking what is easy to measure rather than what is correct to measure. New data scientists get carried away with the latest statistical method or machine learning algorithm, because that’s much more fun than acknowledging that key data are missing.

To create actionable insight, we must start from the action, a choice. Data science is useless if it is not used to make decisions. When starting a project, I first ask how we will measure our progress towards our goals. As my colleague Morgan said last week, this often boils down to revenue, cost, and risk. An economist might bundle that up as time-discounted risk-adjusted future profits. My second task is identifying what decisions we will make in the process of accomplishing these goals.

The choices we make might be between different types of actions or might be between different intensities of an action: which advertising campaign, how much to spend, etc. These choices usually benefit from information. Some choices, such as selecting “red” or “black” at the roulette table, do not benefit from information. The outcome of most choices is partially dependent on information. Knowledge gives us power, but there is some randomness too. We might have hundreds of observations of every American’s response to our spokesperson’s call to action, but the predictive model we generate from that data might not help us after the spokesperson’s embarrassing incident at the golf course. The business case for data science is the estimation of how much information we can gain from our data and how much that information will improve the time-discounted, risk-adjusted benefit of our decisions.

The third task is picking what metrics to use. A management consultant might call this developing key performance indicators. A statistician might call this variable selection. A machine learning practitioner might call this feature engineering. We transform, combine, filter, and aggregate our data in clever and complex ways. Most critical is picking a good dependent variable, or explained variable. This is the metric you are predicting. This will be the distillation of all our knowledge to a single number.

To pick a good dependent variable, a data scientist must consider the quality of the data available and what predictions they might support, but more importantly, the data scientist must consider the decision improved by our prediction. When choosing whether to eat outside for lunch, we prefer to know the temperature at noon rather than the average temperature for the day. More important would be the chance of rain. The exact temperature to the fraction of a degree is unnecessary. Best of all would be a direct estimate of lunchtime happiness for outside versus inside on a scale of, “Yes, go outside” or “No, stay inside.” Unfortunately, we often cannot pick the most directly representative variable, because it is too difficult to measure. Lunchtime surveys would be expensive to conduct and self-reported happiness might be unreliable. A good dependent variable balances predictive power with decision relevance.

After we have built a great predictive model, the last step is figuring out how to operationalize the knowledge we gained. This is where the data science stops and the traditional engineering, or big data engineering, starts. No matter how great our product recommendations are, they are useless if we do not share those recommendations with the customer in a timely manner. In large enterprises, operationalizing insights often requires complex coordination across teams and business units, as hard a problem as the data science. Keeping this operation in mind from the start of the project will ensure the data science has business value.

Michael Selik is a data scientist at Infochimps. Over his career, he has worked for major enterprises and venture-backed startups delivering sophisticated analysis and technology project management services from hyperlocal demographics inference to market share forecasting. With Infochimps, Michael helps organizations deploy fast, scalable data services. He received a MS Economics, a BS Computer Science, and a BS International Affairs from the Georgia Institute of Technology; he likes bicycles and semi-colons.

Image Source:

6e6c46da 2b08 4559 8c27 e09f1e4df781 Nothing so Practical as a Good Theory

Data Science and the Personal Optimization Problem

Data Science 300x174 Data Science and the Personal Optimization Problem“What gets measured gets done” is a common refrain.  And, to a large extent, that is how the business world works.  As Data Scientists, we have an outsized influence on what gets measured (and by extension, what gets done) in a business.  This is especially true with advent of predictive analytics.  We have a lot of responsibility, and we need to use it wisely.

Data Scientists need to be proactive to ensure that what we model and predict and measure provides quantifiable value for our organization.  But how can we do this, realistically?  After all, the numbers are the numbers, we are just drawing conclusions from them.  Right?  The truth is that you can have two Data Scientists develop models with the same tools against the same data and one analysis can be significantly more valuable to the people paying the bills.  It is our own personal optimization problem.

A salesperson usually has a number of accounts where revenue comes in from.  A typical consultant has one or more projects that they can bill hours to.  However, if you are in R&D or on staff in a support role, how can you ensure that your data science is valuable to your organization?

As a Data Scientist, the best barometer for the business value of your work is how well it:

  1. Generates Revenue
  2. Reduces Cost
  3. Eliminates Risk

That sounds great, but does a Data Scientist know that what they are working on is valuable?  This can be especially hard to figure if you are working in a supporting role or are in a shared service environment, such as a centralized data science team in a large organization.  My colleagues and I have had long discussions on this subject, and it seems that there is little consensus on how to do this effectively.

However, I have one sure-fire way to make sure that your data science is as valuable to your organization as you are.

Personal Optimization for Data Scientists

For every project that you work on, imagine that your part is going to be used as an entry on your resume in a section marked “Major Accomplishments” (there are lots of resume guides available that talk about how to do this).  Now, think about a hiring manager who is looking at your resume; not some bozo or corporate drone who is just there to fill bodies. Imagine a shark, someone who knows the industry inside and out and wants only to hire the best; someone who knows the data and the math and can sniff out a phony a mile away.

The hiring manager is going to grill you for detailed answers about your major accomplishments.  They want to know what you know and how you learned it.  They want to know what went well and what didn’t.  They want to know if you can do the same (or better) work for them.  They want to make sure that you know the theory and the application, and can deliver on the goods in a timely manner.  This is the definition of the bottom line.

Can you comfortably sit down in front of this person and talk about your major accomplishments?  Is your data science adding to your list of accomplishments?

Making Data Science Count

Data science has some really fantastic tools such as machine learning, data mining, statistics, and predictive modeling.  They are only going to get better in the future. However, we have to remember that these are just tools at our disposal.  Having skilled craftsmen using the best tools is key, but the most important thing we can do is to make sure that we are building the right things.

One of the things I like best about the Infochimps Cloud is that it takes care of all the infrastructure and architecture work in building a Big Data solution, and lets me focus on really figuring out how to make a valuable solution.  I don’t have to worry about building a Hadoop cluster for batch analytics, or stitching together Storm and Elasticsearch and Kibana to deliver real-time visualizations.  I also don’t have to worry about scaling things up if and when my data volume goes through the roof.

When I build with Infochimps, I know that my effort is being harnessed to build out major accomplishments; not to build sandboxes or dither with infrastructure issues. If you would like to learn more about Infochimps and the value of real-time data science, come by and see us at Strata in New York on October 28-30.  See you there!

Morgan Goeller is a Data Scientist at Infochimps, a CSC company. He is a longtime numbers guy with a B.S. in Mathematics and background in Hadoop, ETL, and Data Warehousing. Morgan lives in Austin, Texas with his wife, sons, and many cats and dogs.

3527b357 2038 47ae a163 deda4a8c5176 Data Science and the Personal Optimization Problem

Photo credit:

Part 2: The Lucky Break Scoreboard

Last week, Infochimps CTO Flip Kromer introduced his truth on the failures that led to the successful acquisition by CSC in his blog post, Part 1: The Truth – We Failed, We Made Mistakes.  Flip continues his blog series with Part 2, his love letter – the real Infochimps story.


7 years ago, having switched majors from Computer Science in college to Physics in grad school, and failing twice to successfully execute a plan of research in Physics, I decided to switch to Education – my favorite part of grad school was teaching. A year before, my ever-patient advisor, physics professor Mike Marder, had started a wildly successful alternative program for a public-school teaching certification. It replaced a full general education curriculum with frequent in-classroom experience and focused education classes  — and it let me reuse the scientific coursework I already had way too much of.

A year later, I was near the end of the program and preparing my teaching portfolio, which led me to spend a lot of time thinking about what I wanted my students to learn, and why. For many of them, my course would be their last formal chance to acquire the skill of quantitatively understanding their universe. As I started to write (less bluntly), I had no interest in burdening them with three different forms of the quadratic equation, or pretending that as a practicing physicist I’d ever used the formula for the perimeter of a trapezoid.

What they should be learning was the ability to make use of a complex information stream, understand sophisticated information displays, and extract straightforward insight using tools such as … … ‽‽

I paused, struck, mid-sentence. Those tools do not exist. Not for a high school student, not for a domain expert in another field, and only after years of study, for me. That’s what I was supposed to be working on: democratizing the ability to see, explore and organize rich information streams.

So as a lapsed computer scientist and failed physicist, I decided to abandon education as well and start yet a different new thing, one that was none of those and all of those together.

Challenge Accepted

I asked Mike Marder if I could come back to his research group and work on tools to visualize data; we could figure out along the way how to tie it into a research plan. I had some savings (thanks largely to my Grandmother, who was just your typical successful 1940’s woman entrepreneur), so I wouldn’t cost him any money. Mike reasoned that although I didn’t know how to solve my own problems, I was frequently useful in helping others solve theirs — and who knows, I seemed really fired up about this new idea whatever it was. So all in all it was an easy decision to hide me away in a shared office and let me get to work.

Building the visualization tool required demonstration data sets to prove the concept, and there are few better than the ocean of numbers around Major League Baseball.

In addition to the retrosheet project — the history of every major-league baseball game back to the 1890s — was publishing one of the most remarkable data sets I knew of. For the past seven years, it gives every single game, every single at-bat, every single play, down to the actual trajectory of every single pitch. I first started playing with the retrosheet data, and found some scattered errors — things like a game-time wind speed of 60mph.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support.)

Weekend Project Gone Awry

Well, the NOAA has weather data. Lots of weather data. The hour-by-hour global weather going back 50 years and more, hundreds of atmospheric measurements for every country in the world, free for the asking. And the Keyhole (now Google Earth) community published map files giving the geolocation of every current and historical baseball stadium.

So if you’re following, we have:

  • A full characterization of every game event
  • … including the time of the game and the stadium it was played in,
  • … and so using the stadium map files, the event’s latitude and longitude
  • … and using that lat/long, all the nearby weather stations
  • … and using the game date and time, the atmospheric conditions governing that event

I connected the data sets looking to correct and fill in the weather data, and found out I accidentally wired up a wind tunnel. There’s no laboratory with the budget to have every major league pitcher throw thousands of pitches for later research purposes — none, except the data set I described.

What’s screwy (and here’s where every practicing data scientist groans and shakes their head) is that the hard part wasn’t performing the analysis. The hard parts were a) making that data useful, and b) connecting the data sets, making them use the same concepts and measurement scales.

But all that work — the mundane, generic work anybody would have to do — just sat there on my hard disk. If I created a useful program, or improved an existing public project, I knew right where to go: open-source collaboration hubs like sourceforge or github. But no such thing existed for data. I had to spend weeks transforming the MLB game data into a form that you could load into a database. If we could avoid that repetition of labor, we would solve the problem of every practicing data scientist.

On Christmas Day 2007, I bought a book on how to build websites using the “Ruby on Rails” framework, and figured I’d knock something useful out in, y’know, a week or so. By sometime that Spring, I had something useful: a few interesting data sets and a website to generically host and describe any further data sets. The initial version of the site was read-only, because I didn’t know how to do join models or form inputs in Ruby on Rails, but I could add new data sets directly to the database. And just like that, Infochimps was born.

I cold-emailed blogger Andy Baio, who linked to “Infochimps, an insane collection of open datasets”. For a guy working alone in an ivory tower, the resulting response was overwhelming.

One of the individuals who emailed to encourage us was Jeff Hammerbacher, founder of the data team at Facebook. Chatting on the phone with him, he told me about a new data analysis tool that Facebook was using, called Hadoop. I looked into it, but couldn’t see how I would ever need to use it. Still, it was really exciting that big names in data were taking interest.

On a trip to San Francisco a few weeks later, I went to a meetup at Freebase. @skud, their community manager, recognized that Infochimps was the perfect raw-data complement to Freebase. She asked me to come back the next month and give a meetup talk. Kurt Bollacker, head of their data team (and future teammate and profoundly valuable mentor), asked me to come back the next day and give an internal lunch lecture. I stayed up all night using google docs on my uncle’s powerpoint-less computer, and gave some hot mess of a presentation to their internal group. Kirrily didn’t uninvite me, so it wasn’t too bad.

It was clear that the lack of a collaboration hub was a problem many people were feeling.

So as a lapsed computer scientist, failed physicist, and no-show educator, I decided to abandon working on a visualization tool and make a collaboration hub instead. Yup.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it.)


One of the new faces on Mike’s research team when I returned was Dhruv Bansal, who was working on a fascinating problem bridging Mike’s two interests: physics and education. They used a freedom-of-information request to acquire a fascinating data set: the anonymized test scores for every student, on every question, for the yearly exam taken by every schoolchild in Texas.

They used the physics equations for fluid flow to model the year-on-year change in student test scores, highlighting patterns that demanded immediate action within the education community.

As you can guess again, the costliest part of that project was not performing the analytics; or applying the Fokker-Planck equation for fluid-flow; or working the paper through peer review. No, the costliest part of the project was the 3-month process of acquiring the data and cleaning it for use. For the random researcher who discovered and requested the data, Dhruv would spend a few hours burning the data to a DVD and physically mail a copy. For reasons I still don’t understand, while researchers in Sociology, Psychology, other “soft” sciences immediately latched on to the usefulness of Infochimps from the very start, Physicists and Computer Scientists almost never understood what we were doing or why it might be valuable. Dhruv and Mike’s split focus meant they got it immediately.

This is probably the most unlikely lucky break, and most crucial development, of this adventure: sitting a few offices away from where I worked was one of the most talented programmers I’ve ever worked with, possessed with a mountainous drive to change the world, the laconic cool to keep me level, and a furious anger at the same exact problem I was working to solve.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv.)

Twitter Dreams

At around this time Twitter was blowing up in popularity, though still a tool largely used by nerds to tell each other about what they had for lunch. We couldn’t explain, any more than most, the appeal of Twitter a social service.

But to 2 physicists with a background in the theory of random network graphs, Twitter as a data set was more than a social service, it was a scientific breakthrough. It implemented a revolutionary new measurement device, giving us an unprecedented ability to quantify relationships among people and conversations within communities. Just as the microscope changed biology, and the X-ray transformed medicine, we knew seeing into a new realm places us on the cusp of a new understanding of the human condition. Making this data available for analysis and collaboration was the best way to provide value and draw attention to the Infochimps site. We emailed Alex Payne, engineering lead at Twitter, for permission to pull in that data and share it with others. He gave me a ready thumbs-up: better that scientists download the data from us, than that they pound it out of his servers.

We wrote a program to ‘crawl’ the user graph: download a user, list their followers, download those users, list their followers, repeat. That was the easy part. Sure, each hundred followers had hundreds of followers themselves, but we could make thousands of requests per hour, millions of requests per week.

The hard part came over the next few weeks as we realized that none of our tools were remotely capable of managing, let along analyzing, the scale of data we so easily pulled in. As quickly as we could learn MySQL, the data set outgrew it. Sure, Dhruv and I could request supercomputer time for research, but supercomputers weren’t actually a good match — they’d be more like a rocketship when what we needed was a fleet of dump trucks. We realized what we needed was Hadoop, the tool Jeff Hammerbacher mentioned to me a few months earlier.

But where could we set up Hadoop? The physics department’s computers were scattered all over and largely locked down. But I also had an account on the UT Math department’s computers. Their sysadmin, Patrick Goetz, was singularly passionate about enabling researchers with the tools they needed to make breakthroughs. He took the much more courageous (and time-consuming for him) route of allowing expert users to install new software across departmental machines.

What’s more, the Math department had just installed a 70-machine educational lab. During the day, it was filled with frustrated freshman fighting Matlab and math majors making their integrals converge. From evening to 6am, however, they were just sitting there… running… inviting someone to put them to good use.

So that’s what we did; put them to good use. We set up Hadoop on each of the machines, modifying their configuration for the comparatively wussy undergrad-lab hardware, and set about using this samizdat supercluster on the Twitter user graph.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv; the explosion of social media data; the invention of Hadoop.)

Data Community

All through 2006-2009, people walking different paths — social media, bioinformatics, web log analysis, graphic design, physics, open government, computational linguistics — were arriving in this wide-open space, forming communities around open data and Big Data.

On Twitter, we were finally seeing what all the people in our favorite data set knew: a novel communication medium that enabled frictionless exchange of ideas and visible community. I’ll call out people like @medriscoll (CEO of Metamarkets), @peteskomoroch (Pricinpal Data Scientist at LinkedIn) @mndoci (Product Manager of Amazon EC2), @hackingdata (Founder of Cloudera, now professor at Mt Sinai School of Medicine), @dpatil (everything), @neilkod and @datajunkie (Facebook data team), @wattsteve (Head of Big Data at Red Hat), among dozens more. It didn’t matter if someone was a random academic, a bored database engineer, a consultant escaping one field into this new one, a big name building the core technology. When you saw a person you respected talking to a person with a good idea, you hit “follow”, and you learned. And when you heard that someone in the Big Data space wasn’t on Twitter, you harangued them until they joined. (Hi, Tom!)

Meanwhile, Aaron Swartz had started the Get.theinfo Google Group. This most minor of his contributions had a larger impact that most know, and was typical of why he’s so missed. He recognized a problem (no conversation space for open-data enthusiasts), built just enough infrastructure to solve it (a google group and a website), then galvanized the community to take over (gifting enthusiastic members with the white elephant of moderator permissions), and offered guidance to make it grow.

The relationships we built and communities we joined became critical catalysts for our growth.

Twitter Reality

We spent the next several months building out the site during the day and running analysis on the growing hundreds of gigabytes by night (does that seem quaintly small now?). Right before Christmas break, we did a set of runs producing data suitable for people in the community to find useful. Hours before hopping on the plane to visit my family, I finished compressing and uploading them, wrote up a minimal readme file, and posted a note to the Get.Theinfo mailing list. I knew the folks there wouldn’t mind the rough cut version, so I figured I’d mention it quietly there, but wait to do a proper release after break — after all, there was no internet where I’d be staying.

Well, two predictable things happened: 1) a huge response, far more than expected, flowing up the chain to large tech blogs and twitter-ers, and 2) a polite but forceful email from Ev Williams (Twitter’s CEO) asked us to take the data files down while they figured out a data terms-of-service. We reluctantly removed the data.

Sure, the experience was a partial success. It brought great publicity, and of course you probably caught the foreshadowing of how important Hadoop was about to become for us. But we failed at the important goal, sharing this immensely valuable data we invested months to release.

Minister of Simplicity

Now to introduce Joe Kelly into the story. Our research center decided to hire someone to build our new website, and one of the respondents to our Craigslist ad was Joe, a former UT business school student who had been working with his roommate to get their general contracting firm off the ground. He didn’t really know how to design websites, but he absolutely loved reading about the science our center was doing, so he applied.

His interview was amazing. He had the design sense of a paper bag compared to the other candidates, but every one of us left the room saying, “wow, that guy was awesome, the kind of person you just want to work with on a project”. Only Dhruv was smart enough to take the face-slappingly obvious next step — replying 1-to-1 to a later email from Joe to say, “well, hey, we also have this other project going on; we don’t really want need your help on the website, but there’s a lot of work to do”. Within days, Joe had set up a bank account and PO box, organized the papers to make us an official partnership, and generally turned this ramshackle project into an infant company. It was an easy decision for Dhruv and I to make him a co-founder.

An easy decision until a few days later, when I read some cautionary article about how the #1 mistake companies make is choosing co-founders hastily. Well, hell. We just made this guy we randomly met a couple weeks ago a co-founder, handing him a huge chunk of the company. I didn’t know if we just made a huge mistake or not.

So the next day, we were hanging out at the Posse East bar (our “office” for the first several months of the company), and Joe introduced us to the idea of an Elevator Pitch. “If we’re going to be at the South by Southwest (SXSW) Conference, we need to be able to explain Infochimps”. I replied with some kind of rambling high-concept noodle. Dhruv rang in with his version — more scientific, more charm and cool, but no more useful than mine.

Joe replied, “No. What Infochimps is this: ‘A website to find or share any data set in the world'”.

I rocked back in my chair and knew Dhruv and I made one of the best decisions of our lives. His version said everything essential, and nothing more. In one week, he understood what we were doing better than we did after a year. Joe’s role emerged as our “Minister of Simplicity”. He removed all complications, handled all necessary details, smoothed all lines of communications, making it possible for our team to Just Hack. Everything essential, and nothing more.

Capital Factory

With the decision to move forward as a company, not an academic project, we applied to the starting class of Capital Factory (Austin’s startup accelerator). It was an amazing experience, and we went hard at it: we hit all the meetings, spent hours working on our pitch, tried to make contact with every mentor, and made an epic application video. (One of Dhruv’s housemates was a professional filmmaker. Friends in high places.)

We got great feedback and obvious interest from the mentors, and were chosen as finalists. We were confident that we had the right combination of team and big idea to merit acceptance.

They rejected us.

After the acquisition, Bryan Menell — one of the Capitol Factory founders — posted a graciously bold blog post explaining what happened. As we later heard from several mentors, they each individually loved our company. Once in the same room though, they found that none of them loved the same company. This mentor loved Infochimps, a company that would monetize social media data. This other one loved Infochimps, a set of brilliant scientists who could help businesses understand their data. Some of them just knew we worked our asses off and were incredibly passionate about whatever the hell it is we were doing but couldn’t explain. A few of the mentors loved Infochimps because we were building something so cool and potentially huge that surely some business value would later emerge. Whichever idea a mentor did like, they generally didn’t like the others.

I can’t overstate how difficult it was to explain what we were doing back then. After two years, we can now crisply state what we had in mind: “A platform connecting every public and commercially available database in the world. We will capture value by bringing existing commercial data to new markets, and creating new data sets from their connections.” It’s easy(er) now, partly because of the time we spent to crystallize an explanation of the idea. Even more so, people now have had years of direct experience and background buzz preparing them to hear the idea. For example, the concept that “sports data” or “twitter data” might have commercial value was barely defensible then, but is increasingly obvious now.

Above all that though, the Capital Factory mentors were right: we were all those ideas, and all of those ideas were (as we’d find out) mostly terrible. And working on the combination of all of them was a beyond-terrible idea. On that point, Capital Factory was right to reject us.

We worked hard, had the perfect opportunity, and failed.

For good reasons and bad, we failed to get in, Or, well, we mostly failed to get in. Some of the mentors liked what they heard enough to stay in touch — meeting for beers and advice, making introductions, and being generous with their time and contacts in many other ways. The Austin startup scene was about to explode, led by Joshua Baer, Jason Cohen, Damon Clinkscales, Alex Jones and others. The energy that the Capital Factory mentors and these other leaders put into mentoring startups like ours ricocheted and multiplied within the community, in the kind of “liquid network” that Steven Johnson writes about. Although the companies within the first CapFac class benefited the most, it was like every startup in Austin was admitted.

The Truth

On the one hand, we had a bunch of fans in blog land, some website code, and a good team. But we had no idea how to make money and a finite runway. Our most notable validation as a project was a failed effort to share data, and our most notable validation as a business was an honorable mention ribbon.

Are you seeing it?

We were experiencing success after success after success.

Every time we failed, a smaller opportunity opened: one that was sharper; one that was more real; one that brought us closer to the right leverage point for changing the world.

These opportunities were smaller, but the energy behind them was the same. We were following what inspired people — to use data sets from Infochimps, to post a data set, to join our pied-piper team, to tweet about us, to make an intro, to have coffee and teach us something. All our ideas were useless crap, except in one essential way: to gather and inspire the people who would help us uncover a few ideas that were good, and execute on them.

(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv; the explosion of social media data; the invention of Hadoop; the completely random intersection with Joe; starting Infochimps just as the Austin startup scene exploded.)

The 3rd part of this blog series will highlight the journey from “project that inspired people” to “business that solved a real problem” — powered by individuals who made sizable investments of time, energy, money and kindness to produce repeated successes from repeated failures, and by the early customers of Infochimps who believed in us.

As we go, that  “lucky break scoreboard” will get more and more improbable, enough to make that word “lucky” ludicrously inapplicable.

Philip (Flip) Kromer is co-founder and CTO of Infochimps where he built scalable architecture that allows app programmers and statisticians to quickly and confidently manipulate data streams at arbitrary scale. He holds a B.S. in Physics and Computer Science from Cornell University and attended graduate school in Physics at the University of Texas at Austin. He authored the O’Reilly book on data science in practice, and has spoken at South by Southwest, Hadoop World, Strata, and CloudCon. Email Flip at or follow him on Twitter at @mrflip.

b0bae296 90b0 4bfe 8177 b5ac72be71c6 Part 2: The Lucky Break Scoreboard

Part 1: The Truth – We Failed, We Made Mistakes

announcement 240x240 Part 1: The Truth   We Failed, We Made Mistakes

As I’m sure most of you have heard, Infochimps was recently acquired by CSC, giving us the resources and mandate to build the Big Data platform of the future. This is a perfect landing for the company and our vision, and we couldn’t be more excited.

The great acquisition stories I’m familiar with have a few commonalities: the companies share a mission and vision; the acquired team works to focus their product and integrate it with the parent company’s offering; and the parent company gives them the resources to succeed without changing what enabled the acquiring team to excel.

A perfect example of this was when Apple bought Siri. At that time, Siri was a cute little iphone app, built on amazing technology and with a highly-respected engineering team behind it. Married to Apple’s powerhouse strengths and global network, the result has transformed the way people interact with machines and is a centerpiece advantage of Apple’s product. Our goal is for nothing less than a similar story within CSC.

CSC is a global corporation that provides information technology (IT) services and professional services. They employ 95,000 people globally, who create a $16B revenue stream serving governments and large enterprise. Our challenge, and we embrace it, is to provide a significant positive return even against that massive background.

We think that we can do so (as do many analysts) because the acquisition marries the signal strengths of Infochimps and CSC:

We live in the future:

  • Proven Big Data expertise and perspective on the technical landscape
  • An indelible culture and a crazy-awesome team
  • Solid open-source citizenship, as contributors to the projects we build on and stewards of well-adopted projects we’ve written

CSC lives at enterprise-scale:

  • 50+ years of expertise in big enterprise and security
  • Passion for building customer solutions and support
  • The resources a $16B revenue stream provides

CSC’s strengths address our biggest weaknesses, letting us focus on what we do best. There are no changes to the team, the culture, our Austin location, our open-source contributions, our development approach, our irreverence, our hiring standards, or our mission to make the world smarter. We’ll continue to operate independently, continue buying lunch for the office every day, and continue open-sourcing the majority of code we write.

So this is a huge win for our team, our customers, our investors, and CSC. I could finish the post right here, and all anyone would remember is that we persevered and reached this milestone through tenacious hard work and great ideas.

Well here’s the truth: The actual history of our company is one of failure after failure, costly mistakes, and multiple near-death experiences. The only reason we’ve “succeeded” is through a preposterous series of lucky breaks and kind acts. Trying to list all the people behind that hard work and those lucky breaks would be foolish. There are too many, and I’ll just offend some by omission. But if you’re reading this, you’re probably one of them; so thank you.Success Failure Part 1: The Truth   We Failed, We Made Mistakes

Now for the real story; the story you probably haven’t heard. The story to show how large the number of people making sizable investments of time, energy, money, and kindness is required to make successes out of failures and how a small favor can change the world. It’s a thank you note to those who have helped us and a love letter to other startups figuring it out as they go. It’s a reminder that this is just another chapter of Infochimps’ book, and we’re nowhere near the resolution.

Thanks and love from the co-founders and whole Infochimps team,
Flip Kromer
Infochimps Co-Founder and CTO

*Update* Flip continued this blog series with Part 2: The Lucky Break Scoreboard, where he explains “with every failure, a smaller opportunity opened: one that was sharper; one that was more real; one that brought us closer to the right leverage point for changing the world.” Read Part 2 >>

Philip (Flip) Kromer is co-founder and CTO of Infochimps where he built scalable architecture that allows app programmers and statisticians to quickly and confidently manipulate data streams at arbitrary scale. He holds a B.S. in Physics and Computer Science from Cornell University and attended graduate school in Physics at the University of Texas at Austin. He authored the O’Reilly book on data science in practice, and has spoken at South by Southwest, Hadoop World, Strata, and CloudCon. Email Flip at or follow him on Twitter at @mrflip.

b0bae296 90b0 4bfe 8177 b5ac72be71c6 Part 1: The Truth   We Failed, We Made Mistakes

Inbound + Nate Silver = Inspired Chimps

At Infochimps, we take professional growth development seriously. Last week, Infochimps gave the marketing team the opportunity to attend Hubspot’s Inbound 2013 conference in the beautiful city of Boston.

The 4-day inbound marketing conference delivered everything from product demos, educational sessions, and networking opportunities well beyond my expectations. The high level of excitement for the keynote speakers was anticipated due to the heavy lineup, but nothing could have prepare me for the inspiration each of them delivered.

Keynote speakers Inbound + Nate Silver = Inspired Chimps

Seth Godin inspired me to become someone everyone remembers when I’m out of the room, Arianna Huffington inspired me to renew myself, Scott Harrison inspired me to give back, and of course, there was Nate Silver, the statistician who has made a big name for himself in the Big Data space, inspired me to be more creative in business.

If I had to focus on one keynote speaker for this blog post, it would undoubtedly be Nate Silver. He was the perfect keynote for us Big Data marketing nerds. Famous for his predictions for the last two presidential elections through data analysis, Nate Silver explained the gap between the promise and the reality of Big Data and proposed 4 suggestions for using data to make better business decisions.

The following image was my favorite slide – representing Big Data’s challenge – compliments to Christopher Penn for capturing a better image than my own.

Big Data Issue1 Inbound + Nate Silver = Inspired Chimps

Too awed to jot down all his inspiring quotes, I tried the best I could.  Then I came across this article, “9 Inspirational Quotes from Nate Silver at HubSpot’s INBOUND 2013“, that pinpoints some spot-on incredible quotes. My favorite quote is, “if you don’t know where you are in the present, it’s hard to take quality steps toward the future.”

Nate concluded his keynote with this final slide, his final suggestion to the road of wisdom:

photo 1024x646 Inbound + Nate Silver = Inspired Chimps

Thank you Infochimps for valuing my professional growth, thank you Hubspot for a successful Inbound conference, and thank you to all the marketers who thrive to inspire each and every day.

6fefa857 2e95 4742 9684 869168ac7099 Inbound + Nate Silver = Inspired Chimps

The President + Infochimps + Austin

obama21 The President + Infochimps + Austin

As Austin continues to thrive, making Top 10 Lists for everything from innovation to affordable housing, President Obama himself came to see what’s going on. “I’ve come to listen and learn and highlight some of the good work that’s being done,” Obama said during his visit to Austin. “Folks around here are doing something right.”

We are doing something right – these lists speak for themselves:

  • Best City for Small Business nationally by The Business Journals
  • #1 large city for young entrepreneurs according to Under30CEO
  • #1 among the 100 largest U.S. metros based on amount recovered from pre-recession peak to the present based on employment, unemployment, output, and house prices according to Brookings Institution
  • #3 fastest-growing tech job market according to
  • #3 “Best Cities for Good Jobs” list according to Forbes

In his recent visit to Austin, President Obama stopped by Capital Factory, an incubator for technology startups where he learned about Austin’s technology community, and was introduced to Infochimps. Wanting to move to Austin?

Come Work With Us >>

6fefa857 2e95 4742 9684 869168ac7099 The President + Infochimps + Austin


A Chimpy Movember

This November, the chimps participated in Movember, the moustache growing charity event held each November that raises awareness and funds for men’s health. There was an objective, rules, and winners – the makings of a friendly competition while building company culture for a worthy cause.

The Objective: To raise money for men’s health through growing facial hair, asking friends and family for donations, and by joining together with other chimps in camaraderie and some good-hearted revelry.

The Rules:
1. You do not have to begin the month with a clean shaven face.
2. You must maintain a moustache continuously from the 15th to end of month.
3. You must end the month with a moustache, but not necessarily the same moustache you started with.
4. A moustache is not a beard. For example: There is no joining of the moustache to the sideburns.
5. A moustache is not a goatee. There is no joining of the handlebars to the chin.
6. Other facial hair is permitted.
7. Category winners are determined on November 30 by consensus vote of the Mo Sistas.
8. Chimpiest Mo is determined on whatever criteria the Mo Sistas agree to on November 30.
9. Each Mo shall conduct themselves as true country gentlemen; each Mo Sista shall conduct themselves as true city ladies.

Category Winners:
The Chimpiest Mo – Travis Dempsey
– The Lamest Mo (for the follically challenged) – Joe Kelly
– The Most Styled Mo – Flip Kromer
(Moustache Memorabilia was awarded to the winners.)

Infochimps Movember1 A Chimpy Movember

(From Left to Right: Mo Sistas, Winning Mos, Miami Vice Mos)

Go Infochimps! We are proud to support Movember, raising awareness and funds for men’s health.

Just because it’s December, doesn’t mean you can’t support men’s health all year round. See the official Movember merchandise page for everything from posters to shoes, and like Movember USA on Facebook.

Infochimps Culture: New CEO, Bocce, Opa!

Yesterday we announced some exciting news. We welcomed Jim Kaskade as our new CEO.

To welcome him to the team, we did it the Infochimps way: Bocce.

What’s Bocce? Bocce is a ball sport popular around Europe that is traditionally played on ground courts between two teams.  You throw a smaller ball (or jack) from one end of the court into a zone from the far end of the court. The objective then is for each team to bowl their four balls, trying to throw the balls closest to the jack alternating turns.

company culture1 300x225 Infochimps Culture: New CEO, Bocce, Opa!

team 300x225 Infochimps Culture: New CEO, Bocce, Opa!

That’s right, we’re cultured as well.

So where can you play Bocce in Austin? At Opa! Coffee and Wine Bar. Who knew?

After some competitive games of Bocce, the team relaxed with some good food and conversation.

See the whole team at Polvos! Jim (bottom left) is already in the chimpy spirit, showing off his Infochimps shirt.

Work hard, play hard: An Infochimps philosophy.

If you or anyone you know is interested in the Infochimps philosophy, we’re hiring! So if you know any Designers, Engineers, or Architects who are interested in working with a world class team of friendly geniuses, send them our way.


 Infochimps Culture: New CEO, Bocce, Opa!