Week 8 and 9

class: center, middle, inverse, title-slide

.title[
# Week 8 and 9
]
.subtitle[
## Selecting predictors and identifying unsual data points
]
.author[
### Suzanne Thornton
]
.institute[
### Swarthmore College
]

---

<style type="text/css">
pre{
  background: #FFBB33;
  max-width: 100%;
  overflow-x: scroll;
}

.scroll-output{
  height: 70%;
  overflow-y: scroll;
}

.scroll-small{
  height: 30%;
  overflow-y: scroll;
}
   
.red{color: #ce151e;}
.green{color: #26b421;}
.blue{color: #426EF0;}
    
    
## Why isn't this running????
</style>
# Decisions about predictors
## Nested F test

```r
data(CrabShip)
CrabShip %>% head
```

```
##   Mass Oxygen   Noise
## 1 22.7   89.2 ambient
## 2 34.6  141.1 ambient
## 3 36.0  140.1 ambient
## 4 40.1  204.9 ambient
## 5 47.5  129.1 ambient
## 6 49.6  154.6 ambient
```

---
# Decisions about predictors
## Nested F test

```r
full <- lm(Oxygen ~ Mass + Noise, CrabShip)
red <- lm(Oxygen ~ Mass, CrabShip)
anova(red, full)
```

```
## Analysis of Variance Table
## 
## Model 1: Oxygen ~ Mass
## Model 2: Oxygen ~ Mass + Noise
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     32 89954                                  
## 2     31 42516  1     47438 34.589 1.721e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Decisions about predictors
## Mallow's Cp

```r
library(olsrr)
# ols_mallows_cp(model, fullmodel)

ols_mallows_cp(red, full)
```

```
## [1] 35.58853
```
---
# Decisions about predictors
## AIC

```r
ols_aic(red, method="R")
```

```
## [1] 370.4313
```

```r
ols_aic(full, method="R")
```

```
## [1] 346.9512
```

.scroll-small[
## BIC

```r
ols_sbic(red, full)
```

```
## [1] 271.5531
```
]

---
# Decisions about predictors
## Compare adjusted coefficient of determination

.scroll-output[

```r
summary(full)$adj.r.squared
```

```
## [1] 0.6038188
```

```r
summary(red)$adj.r.squared
```

```
## [1] 0.1879705
```
]

---
# Identifying unsusal data points
## Added varb plots

.scroll-output[

```r
data(HousesNY)
HousesNY %>% head
```

```
##   Price Beds Baths  Size  Lot
## 1  57.6    3     2 0.960 1.30
## 2 120.0    6     2 2.786 0.23
## 3 150.0    4     2 1.704 0.27
## 4 143.0    3     2 1.200 0.80
## 5  92.5    3     1 1.329 0.42
## 6  50.0    2     1 0.974 0.34
```

Added variable plots are useful when we want to understand relative effects of a particular predictor variable, given that all other predictors have been accounted for in the model. In particular, for MLR models this is a useful technique to identify potentially influential data points that are unusual only if we consider the values of another/other predictor variable(s). 
]

---
# Identifying unsusal data points
## Added varb plots

.scroll-output[

```r
mod0 <- lm(Price ~ Beds + Lot + Size, HousesNY)
mod0 %>% summary 
```

```
## 
## Call:
## lm(formula = Price ~ Beds + Lot + Size, data = HousesNY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.549 -28.878   0.228  29.280  68.743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   37.899     24.397   1.553   0.1268  
## Beds           5.022      9.567   0.525   0.6020  
## Lot            5.934      6.783   0.875   0.3859  
## Size          32.142     12.696   2.532   0.0146 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.3 on 49 degrees of freedom
## Multiple R-squared:  0.2766,	Adjusted R-squared:  0.2324 
## F-statistic: 6.247 on 3 and 49 DF,  p-value: 0.001121
```

Suppose the predictor of interest is `Size`. The larger model includes `Beds`, `Lot`, and `Size`.
]

---
# Identifying unsusal data points
## Added varb plots

.scroll-output[

```r
mod1 <- lm(Price ~ Beds + Lot, HousesNY)
mod1 %>% summary 
```

```
## 
## Call:
## lm(formula = Price ~ Beds + Lot, data = HousesNY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.850 -30.035  -5.439  33.385  69.208 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   32.652     25.590   1.276  0.20787   
## Beds          22.803      6.838   3.335  0.00162 **
## Lot            4.429      7.113   0.623  0.53634   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.21 on 50 degrees of freedom
## Multiple R-squared:  0.182,	Adjusted R-squared:  0.1493 
## F-statistic: 5.563 on 2 and 50 DF,  p-value: 0.006586
```

The smaller model includes only `Beds` and `Lot`.]

---
# Identifying unsusal data points
## Added varb plots

.scroll-output[
Added variable plots would consider the effect of the predictors from the smaller model `Beds` and `Lot` in relation to the predictor of interest, `Size`.

```r
mod2 <- lm(Size ~ Beds + Lot, HousesNY)
mod2 %>% summary 
```

```
## 
## Call:
## lm(formula = Size ~ Beds + Lot, data = HousesNY)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64501 -0.29638  0.02711  0.26535  1.06363 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16325    0.27079  -0.603    0.549    
## Beds         0.55318    0.07236   7.645 5.92e-10 ***
## Lot         -0.04683    0.07527  -0.622    0.537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4044 on 50 degrees of freedom
## Multiple R-squared:  0.5601,	Adjusted R-squared:  0.5425 
## F-statistic: 31.83 on 2 and 50 DF,  p-value: 1.211e-09
```
]

---
# Identifying unsusal data points
## Added varb plots

.scroll-output[
Let's collect the residuals from the smaller model and the residuals from the regression of the predictor of interest (`Size`) on the predictors of the smaller model in a new data object.

```r
added_varb_plot_data <- tibble(res1 = mod1$residuals, res2 = mod2$residuals)
added_varb_plot_data %>% head
```

```
## # A tibble: 6 × 2
##    res1    res2
##   <dbl>   <dbl>
## 1 -49.2 -0.475 
## 2 -50.5 -0.359 
## 3  24.9 -0.333 
## 4  38.4 -0.259 
## 5 -10.4 -0.148 
## 6 -29.8  0.0468
```
]

---
# Identifying unsusal data points
## Added varb plots

.scroll-output[
Now we'll create an added variable plot by plotting these two groups of residuals against one another. We want to look out for any strange data points to investigate them as potentially influential data points.

```r
ggplot(added_varb_plot_data, aes(x=res2, y=res1)) + 
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) + 
  geom_abline(intercept=0) + 
  geom_text(label=rownames(added_varb_plot_data), nudge_y = -5) + 
  labs(title="Added variable plot for housing data", x="Residuals for regressing Beds on Size",y="Residuals for regressing Price on Beds")
```
]

---
# Identifying unsusal data points
## Added varb plots

![](Figs/unnamed-chunk-14-1.png)

---
# Identifying unsusal data points
## Added varb plots 
.scroll-output[
Notice that the slope of the line of best fit for the added variable plot is actually the coefficient of the predictor `Size` in the larger model containing all three predictors.

```r
resids_mod <- lm(res1 ~ res2, added_varb_plot_data)
resids_mod %>% summary()
```

```
## 
## Call:
## lm(formula = res1 ~ res2, data = added_varb_plot_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.549 -28.878   0.228  29.280  68.743 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 4.153e-15  4.887e+00   0.000   1.0000  
## res2        3.214e+01  1.244e+01   2.583   0.0127 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.58 on 51 degrees of freedom
## Multiple R-squared:  0.1157,	Adjusted R-squared:  0.09834 
## F-statistic: 6.672 on 1 and 51 DF,  p-value: 0.01271
```
]

---
# Identifying unsusal data points
## Leverage

`$$Leverage_i = h_i = \frac{1}{n} + \frac{x_i - \bar{x}}{\sum (x_i - \bar{x})^2}$$`

Standardized residuals: `\(\frac{y_i - \bar{y}}{\hat{\sigma}\sqrt{1 - h_i}}\)`

Studentized residuals: `\(\frac{y_i - \bar{y}}{\hat{\sigma}_{(i)}\sqrt{1 - h_i}}\)`, where `\(\hat{\sigma}_{(i)}\)` is the estimate from the model that doesn't include the `\(i^{th}\)` data point.

---
# Identifying unsusal data points
## Leverage

Other measures of influence include

* Cook's distance

* DFfits

* DFbetas