Baseball SeRies: Getting Innings Pitched from Retrosheet with R.

Innings Pitched( IP ) is one of the most important statistics used in baseball  to measure a pitcher’s performance. As it name suggests, IP can represent the total number of innings a pitcher played not only in a game, but also in a season, during his stay on a team or  throughout his career. Furthermore, this stat can also be applied at  team level, league level or season level.

Based on the fact that a half-inning consists of three outs and hence that 1 out represents 1/3 of a half-inning, the IP metric can be calculated for any pitcher by converting the total number of batters and baserunners that were put out while he was at the mound into innings.  Having said that, IP can be mathematically written as:

IP = Quotient( O / 3 ) + Remainder( O / 0.3 )

Where Quotient embodies complete innings pitched and Remainder stands for spare outs. Confusing? I’m sure it is, so lets dive into some examples so that you can understand this better:

  • On the night of June 26th 1990, Fernando «El Toro» Valenzuela( LAD ) and Dave Stewart( OAK ) both pitched a no-hitter game. One as well as the other witnessed 27 outs that day, in other words, the two of them pitched 9 innings.
  • In that same year, the Cincinnati Reds managed to putout 4369 offensive players. They became World Champions after only playing 1456.1 defensive innings.
  • 112691 baserunners and batters were sent back to the dugout during the 1990 season. In total, 37563.2 innings were pitched during that year.

The Code

Along these series I’ll be mining several baseball data banks ( i.e: Retrosheet, Lahman Database ), but whatever the data source is, I will always be making use of the data.table and dplyr packages for the sake of simplicity, comprehensibility and execution performance.

For this exercise, I’ll be using Retrosheet’s 1990 event and game files, which you can download from here.  Assuming that the game and event files are located in the current R working directory and that the data.table and dplyr packages are already installed, we are ready to look into the code:

library( package = 'dplyr' )
library( package = 'data.table' )
# Create vectors that will store column classes for the game and event files
l_e_cols <- c( 'character' # GAME_ID
, rep( x = 'NULL', times = 13 )
, 'character' # PIT_ID
, rep( x = 'NULL', times = 25 )
, 'numeric' # EVENT_OUTS_CT
, rep( x = 'NULL', times = 58 )
, 'character' # FLD_TEAM_ID
, rep( x = 'NULL', times = 58 )
)
l_g_cols <- c( rep( x = 'NULL', times = 8 )
, 'character' # HOME_TEAM_ID
, rep( x = 'NULL', times = 76 )
, 'character' # HOME_TEAM_LEAGUE_ID
, rep( x = 'NULL', times = 93 )
)
# Create vectors that will store column names for the game and event files
l_e_names <- c( 'GAME_ID', 'PIT_ID', 'EVENT_OUTS', 'TEAM_ID' )
l_g_names <- c( 'TEAM_ID','LEAGUE_ID' )
# Load the 1990 season event file into the environment
d_e_1990 <- fread( input = 'all1990.csv'
, sep = ','
, header = T
, colClasses = l_e_cols
, col.names = l_e_names
)
# Load the 1990 season game file into the environment
d_g_1990 <- fread( input = 'games1990.csv'
, sep = ','
, header = T
, colClasses = l_g_cols
, col.names = l_g_names
)
# Game dataset ( d_g_1990 ) has a lot of duplicate records, so get unique observations
d_g_1990 <- distinct( .data = d_g_1990, TEAM_ID, LEAGUE_ID )
# Associate Game dataset and Event dataset
d_1990 <- inner_join( x = d_e_1990, y = d_g_1990, by = c('TEAM_ID') )
# Get IP by every pitcher for every game played in the 1990 season.
d_g_ip <- group_by( .data = d_1990, GAME_ID, PIT_ID ) %>%
summarise( O = sum( x = EVENT_OUTS, na.rm = T ) ) %>%
mutate( IP = O %/%3 + O %% 0.3 )
# Get IP by every pitcher in the 1990 season.
d_s_ip <- group_by( .data = d_1990, PIT_ID ) %>%
summarise( O = sum( x = EVENT_OUTS, na.rm = T ) ) %>%
mutate( IP = O %/%3 + O %% 0.3 )
# Get IP by every team in the 1990 season.
d_t_ip <- group_by( .data = d_1990, TEAM_ID, LEAGUE_ID ) %>%
summarise( O = sum( x = EVENT_OUTS, na.rm = T ) ) %>%
mutate( IP = O %/%3 + O %% 0.3 )
# Get IP by the AL and NL leagues in the 1990 season.
d_l_ip <- group_by( .data = d_1990, LEAGUE_ID ) %>%
summarise( O = sum( x = EVENT_OUTS, na.rm = T ) ) %>%
mutate( IP = O %/%3 + O %% 0.3 )
# Get the total of IP in the 1990 season
d_mlb_ip <- mutate( .data = d_1990, YEAR = 1990 ) %>%
group_by( YEAR ) %>%
summarise( O = sum( x = EVENT_OUTS, na.rm = T ) ) %>%
mutate( IP = O %/%3 + O %% 0.3 )
view raw IP.R hosted with ❤ by GitHub

Deja un comentario