This worksheet is to be completed in class as a group. Take a minute to have everyone introduce themselves and then choose who will play the role of recorder and reporter. The recorder is responsible for completing this worksheet and emailing the .Rmd file to your professor with the subject “Stat 21: Week 7 Worksheet”. Make sure everyone’s name is included at the top of this document! The reporter is responsible for taking note of your answers and any questions that come up to share them with Prof. Suzy or the Muse when they stop by to check in on your group. Other groups members are responsible for keeping the discussion on-task and providing possible solutions or sharing questions with the group.
For this assignment, I suggest that you choose the recorder to be whoever last ate cereal for breakfast and choose the reporter to be whoever hasn’t had cereal for breakfast in the longest time (or ever).
The data set called MBLStandings2016
contains many
different variables for Major League Baseball teams from the 2016
regular season. We are going to explore a few different models for the
response variable WinPct
, the winning percentages for each
of the 30 teams. Each group has been instructed to fit a particular type
of model. Please complete the following four modeling steps for your
assigned model.
library(Stat2Data)
data("MLBStandings2016")
head(MLBStandings2016)
## Team League Wins Losses WinPct BattingAverage Runs Hits HR
## 1 Arizona Diamondbacks NL 69 93 0.426 0.261 752 1479 190
## 2 Atlanta Braves NL 68 93 0.422 0.255 649 1404 122
## 3 Baltimore Orioles AL 89 73 0.549 0.256 744 1413 253
## 4 Boston Red Sox AL 93 69 0.574 0.282 878 1598 208
## 5 Chicago Cubs NL 103 58 0.640 0.256 808 1409 199
## 6 Chicago White Sox AL 78 84 0.481 0.257 686 1428 168
## Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks StrikeOuts Saves
## 1 285 56 709 137 0.320 0.432 5.09 1563 603 1318 31
## 2 295 27 615 75 0.321 0.384 4.51 1414 547 1227 39
## 3 265 6 710 19 0.317 0.443 4.22 1408 545 1248 54
## 4 343 25 836 83 0.348 0.461 4.00 1342 490 1362 43
## 5 293 30 767 66 0.343 0.429 3.15 1125 495 1441 38
## 6 277 33 656 77 0.317 0.410 4.10 1422 521 1270 43
## WHIP
## 1 1.492
## 2 1.355
## 3 1.364
## 4 1.273
## 5 1.110
## 6 1.343
We are going to include the categorical predictor League
in each model we consider in class. We need to verify that R understands
this to be a categorical variable by checking that the type of
programming variable is factor
or fct
.
MLBStandings2016$League %>% class
## [1] "factor"
Since this column is already a factor type, we don’t have to do anything else and we can include this predictor variable in our model in the usual way.
If the output from the above code chunk was anything besides
factor
, then we would have to use the mutate
function to add on a new column to our data set that is specifically a
factor
column. For example:
MLBStandings2016_edit <- MLBStandings2016 %>% mutate(League_fct = factor(League))
MLBStandings2016_edit %>% head
## Team League Wins Losses WinPct BattingAverage Runs Hits HR
## 1 Arizona Diamondbacks NL 69 93 0.426 0.261 752 1479 190
## 2 Atlanta Braves NL 68 93 0.422 0.255 649 1404 122
## 3 Baltimore Orioles AL 89 73 0.549 0.256 744 1413 253
## 4 Boston Red Sox AL 93 69 0.574 0.282 878 1598 208
## 5 Chicago Cubs NL 103 58 0.640 0.256 808 1409 199
## 6 Chicago White Sox AL 78 84 0.481 0.257 686 1428 168
## Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks StrikeOuts Saves
## 1 285 56 709 137 0.320 0.432 5.09 1563 603 1318 31
## 2 295 27 615 75 0.321 0.384 4.51 1414 547 1227 39
## 3 265 6 710 19 0.317 0.443 4.22 1408 545 1248 54
## 4 343 25 836 83 0.348 0.461 4.00 1342 490 1362 43
## 5 293 30 767 66 0.343 0.429 3.15 1125 495 1441 38
## 6 277 33 656 77 0.317 0.410 4.10 1422 521 1270 43
## WHIP League_fct
## 1 1.492 NL
## 2 1.355 NL
## 3 1.364 AL
## 4 1.273 AL
## 5 1.110 NL
## 6 1.343 AL
The R function I()
is useful when fitting higher
order polynomial terms. The model below includes only a fourth
order polynomial terms (an no main effects):
mod_ex1 <- lm(WinPct ~ I(BattingAverage^4), MLBStandings2016)
mod_ex1 %>% summary
##
## Call:
## lm(formula = WinPct ~ I(BattingAverage^4), data = MLBStandings2016)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.125866 -0.049218 -0.001581 0.045756 0.139646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36212 0.07582 4.776 5.12e-05 ***
## I(BattingAverage^4) 32.18552 17.48579 1.841 0.0763 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06361 on 28 degrees of freedom
## Multiple R-squared: 0.1079, Adjusted R-squared: 0.07608
## F-statistic: 3.388 on 1 and 28 DF, p-value: 0.07629
There are several ways to include interaction terms.
I recommend using the :
rather than *
to
indicate a term is the product of two others. The model below includes
an interaction effect only (and no main effects).
mod_ex2 <- lm(WinPct ~ Runs:Hits, MLBStandings2016)
mod_ex2 %>% summary
##
## Call:
## lm(formula = WinPct ~ Runs:Hits, data = MLBStandings2016)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.134349 -0.039615 0.008667 0.041812 0.111690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.468e-01 9.007e-02 2.740 0.01056 *
## Runs:Hits 2.473e-07 8.731e-08 2.832 0.00847 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05938 on 28 degrees of freedom
## Multiple R-squared: 0.2227, Adjusted R-squared: 0.1949
## F-statistic: 8.02 on 1 and 28 DF, p-value: 0.008474
Q1 What are the names of the two predictor variables we have chosen to use in our models?
[Write your answer to Q1 here.]
Q2 Write out the form for the regression model that you are fitting below. Make sure this model includes the error term!
[Write your answer to Q2 here.]
Fit your model to predict win percentage below.
#regmod <- lm(?~?, data=?)
#regmod %>% summary
Q3: What is the estimated regression model for this data? (Make sure this model does NOT include an error term!)
[Write your answer to Q3 here.]
Use the space below to assess your model. Summarize your assessment at the end of this section. In addition to assessing the regression model conditions, also look for influential data points and for evidence of multicollinearity. The code chunk below will help you to create a new data object that contains your model’s fitted values, residuals, standardized residuals.
#MLB_all <- MLBStandings2016 %>% mutate(fits = regmod$fitted.values,
# resids = regmod$residuals,
# sresids = rstandard(regmod))
The code outlines for various plots and assessments are provided below.
## Matrix of scatterplots
library("GGally") ## Make sure the package "GGally" is installed before attempting to run this code!
#ggpairs(MLBStandings2016[ ,c(?,?,?)], colour="gray20") ## replace the ?'s with the column numbers corresponding to your predictor variables
## Residual plot with points labeled
#ggplot(?, aes(x=?, y=?)) +
# geom_point() +
# labs(title="Residual plot",x="Fitted values", y="Standardized residuals") +
# geom_hline(yintercept=0)
## Normal quantile plot
#ggplot(?, aes(sample=?)) +
# geom_qq() +
# geom_qq_line() +
# labs(title="Normal quantile plot", x="Theoretical quantiles", y="Standardized residual quantiles")
## Variance inflation factors for each predictor
library("car") ## Make sure the package "car" is installed before attempting to run this code!
#regmod %>% vif
Q4 Summarize your model assessment here. Make sure you address all six regression conditions: linearity, zero mean, constant variance, independence, normality, and randomness.
[Write your answer to Q4 here.]
Q5 After hearing about the other groups’ models, which model do you think is most appropriate for this data? Write out the complete regression model for your group’s final choice and explain why it is favorable. Also, comment on whether or not this model should be used for estimation or prediction only or if it is also informative for inferential questions.
[Write your answer to Q5 here.]