This worksheet is to be completed in class as a group. Take a minute to have everyone introduce themselves and then choose who will play the role of recorder and reporter. The recorder is responsible for completing this worksheet and emailing the .Rmd file to your professor with the subject “Stat 21: Week 7 Worksheet”. Make sure everyone’s name is included at the top of this document! The reporter is responsible for taking note of your answers and any questions that come up to share them with Prof. Suzy or the Muse when they stop by to check in on your group. Other groups members are responsible for keeping the discussion on-task and providing possible solutions or sharing questions with the group.

For this assignment, I suggest that you choose the recorder to be whoever last ate cereal for breakfast and choose the reporter to be whoever hasn’t had cereal for breakfast in the longest time (or ever).


The data set called MBLStandings2016 contains many different variables for Major League Baseball teams from the 2016 regular season. We are going to explore a few different models for the response variable WinPct, the winning percentages for each of the 30 teams. Each group has been instructed to fit a particular type of model. Please complete the following four modeling steps for your assigned model.

library(Stat2Data)
data("MLBStandings2016")
head(MLBStandings2016)
##                   Team League Wins Losses WinPct BattingAverage Runs Hits  HR
## 1 Arizona Diamondbacks     NL   69     93  0.426          0.261  752 1479 190
## 2       Atlanta Braves     NL   68     93  0.422          0.255  649 1404 122
## 3    Baltimore Orioles     AL   89     73  0.549          0.256  744 1413 253
## 4       Boston Red Sox     AL   93     69  0.574          0.282  878 1598 208
## 5         Chicago Cubs     NL  103     58  0.640          0.256  808 1409 199
## 6    Chicago White Sox     AL   78     84  0.481          0.257  686 1428 168
##   Doubles Triples RBI  SB   OBP   SLG  ERA HitsAllowed Walks StrikeOuts Saves
## 1     285      56 709 137 0.320 0.432 5.09        1563   603       1318    31
## 2     295      27 615  75 0.321 0.384 4.51        1414   547       1227    39
## 3     265       6 710  19 0.317 0.443 4.22        1408   545       1248    54
## 4     343      25 836  83 0.348 0.461 4.00        1342   490       1362    43
## 5     293      30 767  66 0.343 0.429 3.15        1125   495       1441    38
## 6     277      33 656  77 0.317 0.410 4.10        1422   521       1270    43
##    WHIP
## 1 1.492
## 2 1.355
## 3 1.364
## 4 1.273
## 5 1.110
## 6 1.343

We are going to include the categorical predictor League in each model we consider in class. We need to verify that R understands this to be a categorical variable by checking that the type of programming variable is factor or fct.

MLBStandings2016$League %>% class 
## [1] "factor"

Since this column is already a factor type, we don’t have to do anything else and we can include this predictor variable in our model in the usual way.

If the output from the above code chunk was anything besides factor, then we would have to use the mutate function to add on a new column to our data set that is specifically a factor column. For example:

MLBStandings2016_edit <- MLBStandings2016 %>% mutate(League_fct = factor(League))
MLBStandings2016_edit %>% head 
##                   Team League Wins Losses WinPct BattingAverage Runs Hits  HR
## 1 Arizona Diamondbacks     NL   69     93  0.426          0.261  752 1479 190
## 2       Atlanta Braves     NL   68     93  0.422          0.255  649 1404 122
## 3    Baltimore Orioles     AL   89     73  0.549          0.256  744 1413 253
## 4       Boston Red Sox     AL   93     69  0.574          0.282  878 1598 208
## 5         Chicago Cubs     NL  103     58  0.640          0.256  808 1409 199
## 6    Chicago White Sox     AL   78     84  0.481          0.257  686 1428 168
##   Doubles Triples RBI  SB   OBP   SLG  ERA HitsAllowed Walks StrikeOuts Saves
## 1     285      56 709 137 0.320 0.432 5.09        1563   603       1318    31
## 2     295      27 615  75 0.321 0.384 4.51        1414   547       1227    39
## 3     265       6 710  19 0.317 0.443 4.22        1408   545       1248    54
## 4     343      25 836  83 0.348 0.461 4.00        1342   490       1362    43
## 5     293      30 767  66 0.343 0.429 3.15        1125   495       1441    38
## 6     277      33 656  77 0.317 0.410 4.10        1422   521       1270    43
##    WHIP League_fct
## 1 1.492         NL
## 2 1.355         NL
## 3 1.364         AL
## 4 1.273         AL
## 5 1.110         NL
## 6 1.343         AL

The R function I() is useful when fitting higher order polynomial terms. The model below includes only a fourth order polynomial terms (an no main effects):

mod_ex1 <- lm(WinPct ~ I(BattingAverage^4), MLBStandings2016)
mod_ex1 %>% summary
## 
## Call:
## lm(formula = WinPct ~ I(BattingAverage^4), data = MLBStandings2016)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.125866 -0.049218 -0.001581  0.045756  0.139646 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.36212    0.07582   4.776 5.12e-05 ***
## I(BattingAverage^4) 32.18552   17.48579   1.841   0.0763 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06361 on 28 degrees of freedom
## Multiple R-squared:  0.1079, Adjusted R-squared:  0.07608 
## F-statistic: 3.388 on 1 and 28 DF,  p-value: 0.07629

There are several ways to include interaction terms. I recommend using the : rather than * to indicate a term is the product of two others. The model below includes an interaction effect only (and no main effects).

mod_ex2 <- lm(WinPct ~ Runs:Hits, MLBStandings2016)
mod_ex2 %>% summary 
## 
## Call:
## lm(formula = WinPct ~ Runs:Hits, data = MLBStandings2016)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.134349 -0.039615  0.008667  0.041812  0.111690 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 2.468e-01  9.007e-02   2.740  0.01056 * 
## Runs:Hits   2.473e-07  8.731e-08   2.832  0.00847 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05938 on 28 degrees of freedom
## Multiple R-squared:  0.2227, Adjusted R-squared:  0.1949 
## F-statistic:  8.02 on 1 and 28 DF,  p-value: 0.008474

Step 1: Choose

Q1 What are the names of the two predictor variables we have chosen to use in our models?

[Write your answer to Q1 here.]

Q2 Write out the form for the regression model that you are fitting below. Make sure this model includes the error term!

[Write your answer to Q2 here.]

Step 2: Fit

Fit your model to predict win percentage below.

#regmod <- lm(?~?, data=?)
#regmod %>% summary

Q3: What is the estimated regression model for this data? (Make sure this model does NOT include an error term!)

[Write your answer to Q3 here.]

Step 3: Assess

Use the space below to assess your model. Summarize your assessment at the end of this section. In addition to assessing the regression model conditions, also look for influential data points and for evidence of multicollinearity. The code chunk below will help you to create a new data object that contains your model’s fitted values, residuals, standardized residuals.

#MLB_all <- MLBStandings2016 %>% mutate(fits = regmod$fitted.values,
#                                       resids = regmod$residuals,
#                                       sresids = rstandard(regmod))

The code outlines for various plots and assessments are provided below.

## Matrix of scatterplots 
library("GGally") ## Make sure the package "GGally" is installed before attempting to run this code! 
#ggpairs(MLBStandings2016[ ,c(?,?,?)], colour="gray20") ## replace the ?'s with the column numbers corresponding to your predictor variables
## Residual plot with points labeled

#ggplot(?, aes(x=?, y=?)) + 
#  geom_point() + 
#  labs(title="Residual plot",x="Fitted values", y="Standardized residuals") + 
#  geom_hline(yintercept=0)
## Normal quantile plot 

#ggplot(?, aes(sample=?)) + 
#  geom_qq() + 
#  geom_qq_line() + 
#  labs(title="Normal quantile plot", x="Theoretical quantiles", y="Standardized residual quantiles") 
## Variance inflation factors for each predictor 

library("car") ## Make sure the package "car" is installed before attempting to run this code! 
#regmod %>% vif

Q4 Summarize your model assessment here. Make sure you address all six regression conditions: linearity, zero mean, constant variance, independence, normality, and randomness.

[Write your answer to Q4 here.]

Step 4: Use

Q5 After hearing about the other groups’ models, which model do you think is most appropriate for this data? Write out the complete regression model for your group’s final choice and explain why it is favorable. Also, comment on whether or not this model should be used for estimation or prediction only or if it is also informative for inferential questions.

[Write your answer to Q5 here.]