Data Science: State of the Industry

O’Reilly has released their 2013 Data Science Salary Survey, and it’s a treasure trove of interesting information about the work of data science.

One of the most informative things I found was a breakdown of the data tools that were used most often by data scientists.

 Data Science: State of the Industry

This confirms a lot of hunches about the state of the industry:

  • SQL is the mack daddy of data science. It is used literally twice as much as Hadoop.

  • Excel and R are the analysis tools of choice. Since both of these tools can do multiple things (analysis and visualization), it makes sense that these would be more popular than single-use tools.

  • Scripting is widespread and diverse. Python, R, JavaScript, and Ruby are the glue of data science, with an especially strong showing for Python.

The big surprise to me was the relative unpopularity of SAS/SPSS. I think this effect may be exaggerated by the nature of the survey population (it was limited to people attending the Strata conference). However, a 4x disparity between R and Legacy vendors really highlights what I see as an accelerating trend towards open tools.

Another fascinating visualization was the breakdown of how different tools are used together by data scientists.

 Data Science: State of the Industry

In geek speak, this is a graph that describes the positive and negative correlations between tool usage. Visually, this separates into the traditional I/T world (in blue) and the new Hadoop world (in orange). “Visualization” might be a way to describe the red cluster, although Weka really breaks the mold.

What this tells me is that there is a definite geography to the work of data science. If traditional I/T is North America and Hadoop is South America, Tableau would be the Panama Canal, the conduit between the two continents. Also, this picture makes it easy to see why SQL is so popular. Like Starbucks, there’s at least one SQL-like tool in each of the clusters (Hive, MySQL, PostgreSQL, SQL, and SQL Server), with more on the way soon.

Looking at the big picture, this tells us three important things:

  1. Data science can come from anywhere. Innovation does not require the resources of the Fortune 500, nor the specialization of Silicon Valley. The work can leverage the strengths of either environment, and the best people can work anywhere.

  2. Virtually any company either already has or can inexpensively acquire the tools to do data science. If you can download R Studio and have a SQL database, you can start working like the pros.

  3. Data science isn’t thinking about real-time analytics, yet. Storm, Spark, and other tools are still cutting edge. Watch out for this in the 2014 survey.

Thanks O’Reilly, for the insight into data science and data scientists!

Dhruv Bansal is the chief science officer and co-founder of Infochimps, a CSC Big Data Business. He holds a B.A. in math and physics from Columbia University in New York and attended graduate school for physics at The University of Texas at Austin. For more information, email Dhruv at dhruv@infochimps.com or follow him on Twitter at @dhruvbansal.

Image source: strata.oreilly.com





119efc1b cf09 4f4f 9085 057e76e0464c Data Science: State of the Industry




Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>