This worksheet is to be completed in class as a group. Take a minute to have everyone introduce themselves and then choose who will play the role of recorder and reporter. The recorder is responsible for completing this worksheet and emailing the .Rmd file to your professor with the subject “Stat 21: Week 4 Worksheet”. Make sure everyone’s name is included at the top of this document! The reporter is responsible for taking note of your answers and any questions that come up to share them with Prof. Suzy or the Muse when they stop by to check in on your group. Other groups members are responsible for keeping the discussion on-task and providing possible solutions or sharing questions with the group.

For this assignment, I suggest that you choose the recorder to be whoever drove a car the most recently and choose the reporter to be whoever hasn’t driven a car the longest (or ever).

You need to install the package Stat2Data before attempting to run any code chunks below. Then call this package into your working library with the following code:

library(Stat2Data)
data("AccordPrice")

In this worksheet, you are going to practice fitting a SLR model to the “Accord prices” data from Ch 1 of your textbook. This data is called AccordPrice and is located in the Stat2Data R package. The names() function let’s us view the different variable names in the data set and the head() function let’s us view the first few rows.

Note: Below, we will make use of the “pipe” operator which is defined in the R package tidyverse. This operator looks like %>% and can be read as if it is a machine taking whatever is on the left hand side and plugging it into whatever is on the right hand side. If you end a line with the pipe operator, R will continue to read the code on the subsequent line until hitting a stopping point.

AccordPrice %>% names 
## [1] "Age"     "Price"   "Mileage"
AccordPrice %>% head
##   Age Price Mileage
## 1   7  12.0    74.9
## 2   4  17.9    53.0
## 3   4  15.7    79.1
## 4   7  12.5    50.1
## 5   9   9.5    62.0
## 6   1  21.5     4.8

Instructions: Read the following outline and complete the code chunks below to fit a SLR model that predicts the price of the vehicles based on their mileage.

Uncomment the code below and replace the question marks to find the least-squares regression line.

regmodel <- lm(Price ~ Mileage, data = AccordPrice)
regmodel %>% summary
## 
## Call:
## lm(formula = Price ~ Mileage, data = AccordPrice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5984 -1.8169 -0.4148  1.4502  6.5655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.8096     0.9529   21.84  < 2e-16 ***
## Mileage      -0.1198     0.0141   -8.50 3.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.085 on 28 degrees of freedom
## Multiple R-squared:  0.7207, Adjusted R-squared:  0.7107 
## F-statistic: 72.25 on 1 and 28 DF,  p-value: 3.055e-09

Question 1 What is the estimated regression model?

Answer: [Write your answer here]

The fitted values and many other useful statistics are now stored in the regmodel object. Uncomment and run the line below.

 regmodel %>% names 
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Uncomment the code below and replace the question marks to find and assign the fitted values and the residuals from regmodel.

fits <- regmodel$fitted.values
fits
##         1         2         3         4         5         6         7         8 
## 11.835698 14.459580 11.332488 14.807034 13.381272 20.234515 10.098425 18.317524 
##         9        10        11        12        13        14        15        16 
## 20.234515 15.022696 15.238357 20.450177 13.129667 19.815174 17.562709 18.377430 
##        17        18        19        20        21        22        23        24 
## 12.614476 10.397955 13.081742  2.777915 12.997874 14.088163  4.107827 19.144227 
##        25        26        27        28        29        30 
## 18.581111 18.928565 16.196853 18.437336  6.516047  6.132649
resids <- regmodel$residuals 
resids
##          1          2          3          4          5          6          7 
##  0.1643021  3.4404204  4.3675123 -2.3070342 -3.8812720  1.2654845 -6.5984246 
##          8          9         10         11         12         13         14 
##  4.4824757  6.5654845 -1.4226957  4.1616428 -0.9501770 -4.1296669 -2.4151737 
##         15         16         17         18         19         20         21 
##  0.2372910 -0.8774303  0.8855244 -3.3979545 -1.4817422  5.1220854 -1.2978738 
##         22         23         24         25         26         27         28 
##  1.5118375  0.8921728  1.8557732 -2.9811106 -1.9285653 -0.1968528 -0.8373363 
##         29         30 
##  0.3839526 -0.6326491

Question 2 What is smallest observed residual? What is the average fitted value?

Answer: [Write your answers here]

Next, using R as a calculate, we can compute the standard error of regression, \(\hat{\sigma}\), by uncommenting the code below:

n <- length(resids)
SSE <- sum(resids^2)                      #Sum of squared errors
SE <- sqrt(SSE/(n-2))
SE
## [1] 3.08504

You can find this standard error in the output of the summary() function above.

Question 3 What is \(\hat{\sigma}\) called in the R output of the summary() function?

Answer: [Write your answer here]

We can also find the estimate of the error variance by looking at an “analysis of variance” (or ANOVA) table for our regression model. This model decomposes the variability of our response variable into a component due to the regression model (i.e. the predictor) and leftover unexplained variability due to the random error. Uncomment and run the line of code below and identify the estimate, \(\hat{\sigma}^2\).

regmodel %>% anova
## Analysis of Variance Table
## 
## Response: Price
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Mileage    1 687.66  687.66  72.253 3.055e-09 ***
## Residuals 28 266.49    9.52                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we are going to create a residuals plot and a normal quantile plot to help us assess whether some necessary conditions for SLR models seem reasonable for this data. Both plots will be created using ggplot so we first build the base and orient the plot and then add the necessary components on new layers. In order for this to work, we need to have a data object, or a tibble, that we can plug in to the ggplot() function. The mutate() function will allow us to supplement the original data set with additional columns for new variables. Uncomment and complete the code below to add two new columns to the AccordPrice data for containing the model residuals and fitted values.

accord_price_all_data <- AccordPrice %>% mutate(resids = resids, fits = fits)
accord_price_all_data %>% head  
##   Age Price Mileage     resids     fits
## 1   7  12.0    74.9  0.1643021 11.83570
## 2   4  17.9    53.0  3.4404204 14.45958
## 3   4  15.7    79.1  4.3675123 11.33249
## 4   7  12.5    50.1 -2.3070342 14.80703
## 5   9   9.5    62.0 -3.8812720 13.38127
## 6   1  21.5     4.8  1.2654845 20.23452

Uncommement and complete the R code chunk below to build a residuals vs fitted values plot and labels it.

ggplot(accord_price_all_data, aes(x=fits, y=resids)) + 
  geom_point() + 
  labs(title="Residual plot", subtitle="Accord price predicted by mileage", x="Fitted values", y="Residuals") + 
  geom_hline(yintercept=0)

Question 4 What can you tell about the conditions necessary to fit and use a SLR model based on the residuals plot?

Answer: [Write your answer here.]

Uncomment and complete the R code chunk below to build a Normal quantile plot based on this SLR model. Note that in setting the aesthetics of the plot, rather than orient our plot with an x or y variable, we are referencing a sample from which quantiles will be computed.

ggplot(accord_price_all_data, aes(sample=resids)) + 
  geom_qq() +
  geom_qq_line() + 
  labs(title="Normal quantile plot", subtitle="Accord price predicted by mileage", x="Normal quantiles", y="Ordered residuals") 

Question 5 What can you tell about the conditions necessary to fit and use a SLR model based on the residuals plot?

Answer: [Write your answer here.]