Clustering Baseball Data with Weka

This is a guest blog post from Peter Hauck, who works as a data analyst at Google.  His experience includes employee compensation optimization and dynamic pricing of live event tickets. He is a graduate of Cornell University with a B.A. in Mathematics and Physics and an M. Eng in Applied Physics.

Greetings sports fans and data nerds! Since 2004, Major League Baseball has published (x,y) “hit locations” of every at bat and for years, Sabermetric and actuarial analysts have turned this and other data into predictions of where individual sluggers will hit in the future. In hopes of optimally positioning players in the field, professional teams and sports commentators pay handsomely for this kind of forecasting.  The models I’ve seen employ data binning and statistical & probabilistic models to get these results.

In a twist, using the GUI software, Weka, I applied k-means clustering to find patterns in single-season hits record holder, Ichiro Suzuki‘s (x,y) hit locations from 2006.  For readers not familiar, clustering is a computational method of splitting a dataset into neighborhoods of similar points.  Unlike most clustering work, using Weka avoids real programming; once the data was loaded into Weka, the computation required about ten mouse clicks.  I think of this as a semi-scientific, exploratory method that offers quick insight and often reliable conclusions.
sOF spray Clustering Baseball Data with Weka

Above is a scatter plot of Ichiro’s 2006 (x,y) hit locations. The focus is on  slap hits and line drives, the hits that make Ichiro such a formidable batter. Home runs, pop outs, and bunt ground outs have been dropped from the dataset.

sOF cluster output cropped Clustering Baseball Data with Weka

The proper number of clusters to assign to a dataset is an important point of argument, but I intuitively chose six. The above output shows the x, y location of the six clusters we requested.

sOF R 6 centroids 1024x759 Clustering Baseball Data with WekaThe red dots mark the centroids generated by Weka. My favorite Weka feature is the data cursor. As shown below, clicking on a scatter plot point pulls up a window with it’s complete set of attributes.

chipperNotHomerCluster copy 1024x668 Clustering Baseball Data with WekaThis allows users to recognize patterns within clusters and classify them using human insight.  By clicking on the data points next to the centroid over the pitcher’s mound, I quickly surmised that they were mostly ground outs.

Although I haven’t tried it, I could imagine performing clustering on interesting subsets such as hit locations for at bats against left-handed closers, etc. One drawback I experienced with my installation of Weka 3.6.0 was its inability to add clusters back into the dataset—the documentation suggests that it can be done in the command-line version.  Also, the Levenshtein—or edit distance, used for comparing test strings, wasn’t included in the Explorer GUI at the time of this post.  I also couldn’t find means of computing the silhouette coefficient—a measure of the success of a cluster analysis.  However, none of this will keep me from using Weka for a quick and lazy look at numerical data in the future.

How I Made the Graphs

I assembled the 2006 Ichiro Suzuki hit location dataset using methods from Joseph Adler’s Baseball Hacks.  Specifically, hacks #27, #28, and #29.  While some of the scripts for spidering from can be downloaded directly from the O’Reilly website, I strongly suggest using the book because you will almost certainly need to edit the scripts.  After using Adler’s spider methods to download the 2006 play-by-play files,  I used a version of his parser and matchup scripts to create an SQL database of hitters.  From there it was simple to select the set of Ichiro’s at bats and save them to a .csv file.

The baseball diamond scatter plots were created in R.  For inspiration, see hack #37.  In R, I used the subset() function to exclude home runs, pop-outs, etc.

The Weka Explorer is fairly intuitive. In the Prepossessing tab, I left data unfiltered.  In the Clustering tab, I chose simple-k-means clustering with six centroids.  I set it to ignore all data attributes other than x and y.  The multicolored scatter plots were generated by the Visualize tab.


  1. Vijay September 19, 2012 at 9:43 am

    Nice post. Have you looked into mahout ? I am just trying to see how much weka can scale for extremely large data sets compared to the scalability that mahout promises.

  2. Pingback: Outfielders: Step Back from the Centroid | Infochimps Blog