- September 29, 2011
My little league coaches drilled it into us outfielders: “It’s harder to run backward than forward, so stand where you think the batter will probably hit, and then take a few large steps back.” (Full disclosure: my little league career wasn’t so illustrious.)
A reader of my last blog post about Ichiro’s hit locations seems to disagree with my coaches:
“[Based on Ichiro's hit locations], this says to me that the traditional baseball positions are relatively optimal in terms of covering the field…”
To rehash, the June post, Clustering Baseball Data with Weka, gave an example of applying k-means clustering to Suzuki Ichiro’s 2006 x-y hit locations.
Left: The unprocessed hit locations—with outliers like home runs removed.
Right: The same set divided into six clusters using the k-means algorithm and the Euclidian distance. Centroids are red.
I was motivated by Flip Kromer‘s advice to explore baseball with user-friendly modeling software like Weka. The intended theme was that data familiar to our daily lives—like baseball hit locations—could be classified or qualitatively described using machine learning. For the uninitiated, clustering can be thought of a as an algorithm to break one large set into subsets of points that are similar to each other. Each of these sets has a centroid –which can loosely be thought of as the center of the cluster. Hypothetically, a baseball coach could look at the above graph and immediately focus on each colored region as a zone to be defended (i.e. where to place their outfielders).
Now let me return to addressing the comment. To paraphrase our commenter, he claims that the k-means centroid is the “relatively optimal” place to stand if a fielder wants to reach all points readily.
That would be true if the outfielder viewed batted balls the same way the clustering algorithm saw them. But as anyone who’s played a sport involving a moving ball will tell you, the amount of effort needed to reach a given point on the field depends on not just on its distance—it also depends on its direction from his own starting position. Think about whether it is easier for an outfielder to catch a ball about to land 15 feet in front of him, or 15 feet behind him.
Consider a professional outfielder who runs straight ahead with his maximum speed. He can run approximately the same way toward a ball hit to his left or right, but significantly more slowly to a ball hit over his head.
Consider the idealized outfielder who runs backward at 20 ft/s, and forward at 27 ft/s. The ratio of front to back speed is approximately 1.3. The black line represents the greatest reach that the idealized outfielder has for some length of time. If the dot is point (0,0), the left side is an ellipse with a = 20 ft/s, and b = 27 ft/s. The right side is a circle with radius 20 ft/s. These speeds are based on the backward 50 yard-dash record–the most reasonable data I found on backward running speed.
So perhaps an outfielder’s ability to catch balls depends on the reach of this egg-shaped perimeter. (If some one wrote a clustering program with the egg-distance, we might have a better solution to begin with.)
Consider the set of 106 left field points, plotted in light green in the above graph, as being assigned just to the left fielder. The figure below illustrates how many points fall into the idealized fielder’s reach if he runs for various lengths of time.
I ran a simulation to try to understand how many points the idealized fielder could reach if he started somewhere other than the centroid. Given a length of running-time, it searched for the position allowing him to reach the greatest number of points. The results are illustrated below.
From a perturbed position, the fielder could reach 54 different points in 0.5 sec. From another perturbed position, he could reach 95 points in 1.0 sec. He reached all 106 points in 1.5 sec.
It seems as though the number of points reached by the idealized fielder in 0.5 sec or 1.0 sec could not be improved greatly by moving him off the centroid. But by moving significantly back from the centroid, the fielder could reach the whole cluster in 0.6 sec less than if he was standing on the centroid. For those hard-to-reach points at the edge of the cluster, those 0.6 extra seconds could probably come in very handy to a real outfielder trying to make the big catch.
Rigorously optimizing the outfielder’s starting position is complex and probably beyond the scope of the purely spacial information given by the MLB records. We don’t know the time elapsed between contact with the bat and the hit landing in the field or the angle at which it approached the ground. That the optimal positions for a 0.5 second reach (the orange dot), and the 1.5 second reach (the green dot) are significantly far apart suggests that the characteristics of the outfielder are extremely important when choosing his/her starting position. This becomes even more important with real fielders running more slowly than our idealized one, and needing time to judge the direction of the ball.
I can only hope that because of this article, a new generation of little league outfielders will be told over and over: “Optimize your fielding reach by stepping back from the centroid!”