MLR Comparing Models and Identifying Interesting Data Points

class: center, middle, inverse, title-slide

.title[
# MLR Comparing Models and Identifying Interesting Data Points
]
.subtitle[
## Stat 21
]
.author[
### Suzanne Thornton
]
.institute[
### Swarthmore College
]

---

<style type="text/css">
pre{
  background: #FFBB33;
  max-width: 100%;
  overflow-x: scroll;
}

.scroll-output{
  height: 70%;
  overflow-y: scroll;
}

.scroll-small{
  height: 30%;
  overflow-y: scroll;
}
   
.red{color: #ce151e;}
.green{color: #26b421;}
.blue{color: #426EF0;}
    
    
## Why isn't this running????
</style>
# Decisions about predictors
## Nested F test

Nested F-tests can test whether or not a group of predictor *terms* have zero coefficients in a multiple regression model. This is particularly useful when assessing whether or not to include a categorical predictor or higher order polynormial terms. Consider the `CrabShip` data which has a categorical predictor `Noise`.

```r
data(CrabShip)
CrabShip %>% head
```

```
##   Mass Oxygen   Noise
## 1 22.7   89.2 ambient
## 2 34.6  141.1 ambient
## 3 36.0  140.1 ambient
## 4 40.1  204.9 ambient
## 5 47.5  129.1 ambient
## 6 49.6  154.6 ambient
```

---
# Decisions about predictors
## Nested F test

If we want to test whether or not `Noise` is a significantly non-zero predictor for `Oxygen`, we can consider the model with and without this predictor. Note that the null hypothesis in the `anova` test below is that `\beta_j=0` for every `j=1,...,k-1` where `k=` the number of levels for the categorical predictor `Noise`.

```r
full <- lm(Oxygen ~ Mass + Noise, CrabShip)
red <- lm(Oxygen ~ Mass, CrabShip)
anova(red, full)
```

```
## Analysis of Variance Table
## 
## Model 1: Oxygen ~ Mass
## Model 2: Oxygen ~ Mass + Noise
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     32 89954                                  
## 2     31 42516  1     47438 34.589 1.721e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Decisions about predictors
## Compare adjusted coefficient of determination

We can also make decisions about whether or not to include certain predictor terms or variables by assessing the adjusted `\(R^2\)` values for each model under consideration. Such comparison is a statistical estimation problem, not an inferential problem.

```r
summary(full)$adj.r.squared
```

```
## [1] 0.6038188
```

```r
summary(red)$adj.r.squared
```

```
## [1] 0.1879705
```

---
# Decisions about predictors
## Mallow's Cp

Another useful summary statistic to compare models is Mallow's `\(C_p\)`. This is a comparative summary statistic when assessing a model with `\(m\)` predictors compared to a larger model with `\(k\)` predictors ($1<m<k$). **Just like** in the nested F-test, the models being compared **must** be nested. A smaller value of this summary statistic is preferable as it is a relative measure of terms involving the sum squares of the models' residuals.

```r
library(olsrr)
ols_mallows_cp(red, full)
```

```
## [1] 35.58853
```

---
# Identifying unsusal data points
## Added varb plots

Added variable plots are useful when we want to understand relative effects of a particular predictor variable, given that all other predictors have been accounted for in the model. In particular, for MLR models this is a useful technique to identify potentially influential data points that are unusual only if we consider the values of another/other predictor variable(s).

```r
data(HousesNY)
HousesNY %>% head
```

```
##   Price Beds Baths  Size  Lot
## 1  57.6    3     2 0.960 1.30
## 2 120.0    6     2 2.786 0.23
## 3 150.0    4     2 1.704 0.27
## 4 143.0    3     2 1.200 0.80
## 5  92.5    3     1 1.329 0.42
## 6  50.0    2     1 0.974 0.34
```

---
# Identifying unsusal data points
## Added varb plots

Suppose the predictor of interest is `Size`. The larger model includes `Beds`, `Lot`, and `Size`.

.scroll-small[

```r
mod0 <- lm(Price ~ Beds + Lot + Size, HousesNY)
mod0 %>% summary 
```

```
## 
## Call:
## lm(formula = Price ~ Beds + Lot + Size, data = HousesNY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.549 -28.878   0.228  29.280  68.743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   37.899     24.397   1.553   0.1268  
## Beds           5.022      9.567   0.525   0.6020  
## Lot            5.934      6.783   0.875   0.3859  
## Size          32.142     12.696   2.532   0.0146 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.3 on 49 degrees of freedom
## Multiple R-squared:  0.2766,	Adjusted R-squared:  0.2324 
## F-statistic: 6.247 on 3 and 49 DF,  p-value: 0.001121
```
]

---
# Identifying unsusal data points
## Added varb plots

The smaller model includes only `Beds` and `Lot`.

```r
mod1 <- lm(Price ~ Beds + Lot, HousesNY)
mod1 %>% summary 
```

```
## 
## Call:
## lm(formula = Price ~ Beds + Lot, data = HousesNY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.850 -30.035  -5.439  33.385  69.208 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   32.652     25.590   1.276  0.20787   
## Beds          22.803      6.838   3.335  0.00162 **
## Lot            4.429      7.113   0.623  0.53634   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.21 on 50 degrees of freedom
## Multiple R-squared:  0.182,	Adjusted R-squared:  0.1493 
## F-statistic: 5.563 on 2 and 50 DF,  p-value: 0.006586
```

---
# Identifying unsusal data points
## Added varb plots

Added variable plots would consider the effect of the predictors from the smaller model `Beds` and `Lot` in relation to the predictor of interest, `Size`.

```r
mod2 <- lm(Size ~ Beds + Lot, HousesNY)
mod2 %>% summary 
```

```
## 
## Call:
## lm(formula = Size ~ Beds + Lot, data = HousesNY)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64501 -0.29638  0.02711  0.26535  1.06363 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16325    0.27079  -0.603    0.549    
## Beds         0.55318    0.07236   7.645 5.92e-10 ***
## Lot         -0.04683    0.07527  -0.622    0.537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4044 on 50 degrees of freedom
## Multiple R-squared:  0.5601,	Adjusted R-squared:  0.5425 
## F-statistic: 31.83 on 2 and 50 DF,  p-value: 1.211e-09
```

---
# Identifying unsusal data points
## Added varb plots

Let's collect the residuals from the smaller model and the residuals from the regression of the predictor of interest (`Size`) on the predictors of the smaller model in a new data object.

```r
added_varb_plot_data <- tibble(res1 = mod1$residuals, res2 = mod2$residuals)
added_varb_plot_data %>% head
```

```
## # A tibble: 6 × 2
##    res1    res2
##   <dbl>   <dbl>
## 1 -49.2 -0.475 
## 2 -50.5 -0.359 
## 3  24.9 -0.333 
## 4  38.4 -0.259 
## 5 -10.4 -0.148 
## 6 -29.8  0.0468
```

---
# Identifying unsusal data points
## Added varb plots

Now we'll create an added variable plot by plotting these two groups of residuals against one another. We want to look out for any strange data points to investigate them as potentially influential data points.

```r
ggplot(added_varb_plot_data, aes(x=res2, y=res1)) + 
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) + 
  geom_abline(intercept=0) + 
  geom_text(label=rownames(added_varb_plot_data), nudge_y = -5) + 
  labs(title="Added variable plot for housing data", x="Residuals for regressing Beds on Size",y="Residuals for regressing Price on Beds")
```

---
# Identifying unsusal data points
## Added varb plots

![](MLR-additional-topics_files/figure-html/addedvarbplot-1.png)

---
# Identifying unsusal data points
## Added varb plots

Notice that the slope of the line of best fit for the added variable plot is actually the coefficient of the predictor `Size` in the larger model containing all three predictors.

```r
resids_mod <- lm(res1 ~ res2, added_varb_plot_data)
resids_mod %>% summary()
```

```
## 
## Call:
## lm(formula = res1 ~ res2, data = added_varb_plot_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.549 -28.878   0.228  29.280  68.743 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 4.153e-15  4.887e+00   0.000   1.0000  
## res2        3.214e+01  1.244e+01   2.583   0.0127 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.58 on 51 degrees of freedom
## Multiple R-squared:  0.1157,	Adjusted R-squared:  0.09834 
## F-statistic: 6.672 on 1 and 51 DF,  p-value: 0.01271
```

---
# Identifying unsusal data points
## Leverage

`$$Leverage_i = h_i = \frac{1}{n} + \frac{x_i - \bar{x}}{\sum (x_i - \bar{x})^2}$$`

In R, we use the function `hatvalues()` to calculate the leverage for each data point. For example, in `mod0` with the housing data:

```r
mod0 %>% hatvalues ## Note that the input is the model, not the data
```

```
##          1          2          3          4          5          6          7 
## 0.05746567 0.24215674 0.04928657 0.03208284 0.03363332 0.09784971 0.10144826 
##          8          9         10         11         12         13         14 
## 0.11165497 0.04227226 0.06046648 0.03086568 0.06384987 0.02505667 0.03040575 
##         15         16         17         18         19         20         21 
## 0.05021722 0.06730199 0.04452998 0.08379383 0.09010030 0.02873509 0.05678601 
##         22         23         24         25         26         27         28 
## 0.07083918 0.10057508 0.22117033 0.02912529 0.08259537 0.08243321 0.06281797 
##         29         30         31         32         33         34         35 
## 0.17384227 0.04289596 0.11560486 0.26386171 0.04064779 0.06285799 0.04447776 
##         36         37         38         39         40         41         42 
## 0.12277182 0.03894932 0.06969216 0.03600822 0.04238490 0.11157827 0.08224534 
##         43         44         45         46         47         48         49 
## 0.04819574 0.04266616 0.06344810 0.03759056 0.05666408 0.11541469 0.03121585 
##         50         51         52         53 
## 0.07859199 0.10368117 0.07984430 0.04535338
```

---
# Identifying unsusal data points
## Leverage

Leverage is actually used in the calulations of the standardized and studentized residuals.

* Standardized residuals: `\(\frac{y_i - \bar{y}}{\hat{\sigma}\sqrt{1 - h_i}}\)`

We can calculate the standardized residuals using the `rstandard()` function in R.

* Studentized residuals: `\(\frac{y_i - \bar{y}}{\hat{\sigma}_{(i)}\sqrt{1 - h_i}}\)`, where `\(\hat{\sigma}_{(i)}\)` is the estimate from the model that doesn't include the `\(i^{th}\)` data point.

We can calculate the studentized residuals using the `rstudent()` function in R.

.footnote[Note, the input of these functions in R should be the *model* not the data.]

---
# Identifying unsusal data points
## Leverage

Other measures of influence include

* Cook's distance can be calulated using the function `cooks.distance()` in R (where again, the input is the model, not the data).

```r
mod0 %>% cooks.distance 
```

```
##            1            2            3            4            5            6 
## 1.413538e-02 1.213439e-01 1.314107e-02 1.418003e-02 2.201015e-04 2.230206e-02 
##            7            8            9           10           11           12 
## 9.345626e-04 2.584234e-03 3.219548e-02 8.582777e-04 2.651157e-02 1.021610e-02 
##           13           14           15           16           17           18 
## 6.434225e-03 6.377695e-03 2.423212e-02 2.046011e-04 2.793853e-02 6.072641e-03 
##           19           20           21           22           23           24 
## 1.721907e-02 1.109198e-03 3.230200e-02 2.552244e-02 5.439213e-02 1.479885e-01 
##           25           26           27           28           29           30 
## 3.379619e-04 9.693781e-07 1.314690e-03 9.005140e-03 1.877188e-02 1.647495e-03 
##           31           32           33           34           35           36 
## 5.181849e-02 1.399116e-02 1.615643e-02 5.060605e-03 1.065972e-02 1.480775e-02 
##           37           38           39           40           41           42 
## 6.554121e-03 1.216308e-03 3.063237e-02 1.036617e-03 3.217909e-05 8.755088e-02 
##           43           44           45           46           47           48 
## 3.036361e-03 2.036300e-02 1.630211e-02 3.840482e-03 1.035753e-02 2.962539e-02 
##           49           50           51           52           53 
## 6.984666e-03 7.812609e-02 3.058865e-02 5.320046e-04 5.160044e-03
```