The most common error I have encountered among new data science practitioners is forgetting that the goal is not simply knowledge, but actionable insight. This isn’t limited to data scientists. Many analysts get carried away with the wrong metrics, tracking what is easy to measure rather than what is correct to measure. New data scientists get carried away with the latest statistical method or machine learning algorithm, because that’s much more fun than acknowledging that key data are missing.
To create actionable insight, we must start from the action, a choice. Data science is useless if it is not used to make decisions. When starting a project, I first ask how we will measure our progress towards our goals. As my colleague Morgan said last week, this often boils down to revenue, cost, and risk. An economist might bundle that up as time-discounted risk-adjusted future profits. My second task is identifying what decisions we will make in the process of accomplishing these goals.
The choices we make might be between different types of actions or might be between different intensities of an action: which advertising campaign, how much to spend, etc. These choices usually benefit from information. Some choices, such as selecting “red” or “black” at the roulette table, do not benefit from information. The outcome of most choices is partially dependent on information. Knowledge gives us power, but there is some randomness too. We might have hundreds of observations of every American’s response to our spokesperson’s call to action, but the predictive model we generate from that data might not help us after the spokesperson’s embarrassing incident at the golf course. The business case for data science is the estimation of how much information we can gain from our data and how much that information will improve the time-discounted, risk-adjusted benefit of our decisions.
The third task is picking what metrics to use. A management consultant might call this developing key performance indicators. A statistician might call this variable selection. A machine learning practitioner might call this feature engineering. We transform, combine, filter, and aggregate our data in clever and complex ways. Most critical is picking a good dependent variable, or explained variable. This is the metric you are predicting. This will be the distillation of all our knowledge to a single number.
To pick a good dependent variable, a data scientist must consider the quality of the data available and what predictions they might support, but more importantly, the data scientist must consider the decision improved by our prediction. When choosing whether to eat outside for lunch, we prefer to know the temperature at noon rather than the average temperature for the day. More important would be the chance of rain. The exact temperature to the fraction of a degree is unnecessary. Best of all would be a direct estimate of lunchtime happiness for outside versus inside on a scale of, “Yes, go outside” or “No, stay inside.” Unfortunately, we often cannot pick the most directly representative variable, because it is too difficult to measure. Lunchtime surveys would be expensive to conduct and self-reported happiness might be unreliable. A good dependent variable balances predictive power with decision relevance.
After we have built a great predictive model, the last step is figuring out how to operationalize the knowledge we gained. This is where the data science stops and the traditional engineering, or big data engineering, starts. No matter how great our product recommendations are, they are useless if we do not share those recommendations with the customer in a timely manner. In large enterprises, operationalizing insights often requires complex coordination across teams and business units, as hard a problem as the data science. Keeping this operation in mind from the start of the project will ensure the data science has business value.
Michael Selik is a data scientist at Infochimps. Over his career, he has worked for major enterprises and venture-backed startups delivering sophisticated analysis and technology project management services from hyperlocal demographics inference to market share forecasting. With Infochimps, Michael helps organizations deploy fast, scalable data services. He received a MS Economics, a BS Computer Science, and a BS International Affairs from the Georgia Institute of Technology; he likes bicycles and semi-colons.
Image Source: blog.cmbinfo.com