The MLBAM Pitch by Pitch Files
MLBAM is the technology provider for the Major League Baseball and it’s probably the biggest media company you’ve never heard of. Based in New York City, MLBAM develops and maintains live streaming platforms, designs products for several types of devices, builds digital marketing solutions, supports ticketing strategies and most important: produces data files that store all the events that took place in every game in a season.
Some of these files contain the (x,y) locations of the batted balls put in play and some others have general information about the game itself. In this post I’ll be showing you how to extract data from the hit and game files and how to plot these in R. You can download the MLBAM’s files I used for this post from here.
Scrapping the Data from MLBAM
So just for fun I decided to get the data using python and lxml. Code is pretty straight forward; supposing the script is stored where the hit and game files are, it:
- Stores the path to the lists the game and hip files.
- Loops through the files and scraps the desired data from them via XPath.
- Saves the scrapped data into a csv file.
Once the data gets scrapped from the MLBAM files, we are ready to do some quick analysis on it. Since the goal of this research is to plot the balls put in play by both teams, it’s probably a good idea to see how the batted balls in play look like in a Cartesian plane:
So at first glance we are able to see a couple of issues:
- There some batted balls that hold an Error description and a (0,0) position in the plane.
- Balls with a higher distance travelled have a lower y value. In other words, Home Runs, have a value near to zero and bunted balls carry a value very distant from x axis. This a problem because when values get plotted, they give the impression that the baseball field is turned upside-down( just like in the above plot ).
Creating the Plot
So as I explained in the last section, the error and coordinates issues need to be solved before we can do any graphical representations of the batted balls in play. As you can see in the R code below, I fixed these problems in lines 24 and 27 respectively. Please note that I equaled y_max to 250 because this value fit the graph dimensions perfectly.
Furthermore, I drew the foul lines and the bases based on the Pythagorean Thereom knowing that the distance between every base is 90 ft and that the foul lines have a 45 degree angle from the home plate. Please note that just like the y_max variable, the f_len and b_len variables were also adapted to fit the graph.
World Series 2016
Here’s how the batted balls in play look like for the World Series that took place this year. Is there any pattern you can see in the balls batted by the Chicago Cubs in the Progressive Field? Fly the W !