class: center, middle, inverse, title-slide .title[ # Week 8 and 9 ] .subtitle[ ## Selecting predictors and identifying unsual data points ] .author[ ### Suzanne Thornton ] .institute[ ### Swarthmore College ] --- <style type="text/css"> pre{ background: #FFBB33; max-width: 100%; overflow-x: scroll; } .scroll-output{ height: 70%; overflow-y: scroll; } .scroll-small{ height: 30%; overflow-y: scroll; } .red{color: #ce151e;} .green{color: #26b421;} .blue{color: #426EF0;} ## Why isn't this running???? </style> # Decisions about predictors ## Nested F test ```r data(CrabShip) CrabShip %>% head ``` ``` ## Mass Oxygen Noise ## 1 22.7 89.2 ambient ## 2 34.6 141.1 ambient ## 3 36.0 140.1 ambient ## 4 40.1 204.9 ambient ## 5 47.5 129.1 ambient ## 6 49.6 154.6 ambient ``` --- # Decisions about predictors ## Nested F test ```r full <- lm(Oxygen ~ Mass + Noise, CrabShip) red <- lm(Oxygen ~ Mass, CrabShip) anova(red, full) ``` ``` ## Analysis of Variance Table ## ## Model 1: Oxygen ~ Mass ## Model 2: Oxygen ~ Mass + Noise ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 32 89954 ## 2 31 42516 1 47438 34.589 1.721e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Decisions about predictors ## Mallow's Cp ```r library(olsrr) # ols_mallows_cp(model, fullmodel) ols_mallows_cp(red, full) ``` ``` ## [1] 35.58853 ``` --- # Decisions about predictors ## AIC ```r ols_aic(red, method="R") ``` ``` ## [1] 370.4313 ``` ```r ols_aic(full, method="R") ``` ``` ## [1] 346.9512 ``` .scroll-small[ ## BIC ```r ols_sbic(red, full) ``` ``` ## [1] 271.5531 ``` ] --- # Decisions about predictors ## Compare adjusted coefficient of determination .scroll-output[ ```r summary(full)$adj.r.squared ``` ``` ## [1] 0.6038188 ``` ```r summary(red)$adj.r.squared ``` ``` ## [1] 0.1879705 ``` ] --- # Identifying unsusal data points ## Added varb plots .scroll-output[ ```r data(HousesNY) HousesNY %>% head ``` ``` ## Price Beds Baths Size Lot ## 1 57.6 3 2 0.960 1.30 ## 2 120.0 6 2 2.786 0.23 ## 3 150.0 4 2 1.704 0.27 ## 4 143.0 3 2 1.200 0.80 ## 5 92.5 3 1 1.329 0.42 ## 6 50.0 2 1 0.974 0.34 ``` Added variable plots are useful when we want to understand relative effects of a particular predictor variable, given that all other predictors have been accounted for in the model. In particular, for MLR models this is a useful technique to identify potentially influential data points that are unusual only if we consider the values of another/other predictor variable(s). ] --- # Identifying unsusal data points ## Added varb plots .scroll-output[ ```r mod0 <- lm(Price ~ Beds + Lot + Size, HousesNY) mod0 %>% summary ``` ``` ## ## Call: ## lm(formula = Price ~ Beds + Lot + Size, data = HousesNY) ## ## Residuals: ## Min 1Q Median 3Q Max ## -64.549 -28.878 0.228 29.280 68.743 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.899 24.397 1.553 0.1268 ## Beds 5.022 9.567 0.525 0.6020 ## Lot 5.934 6.783 0.875 0.3859 ## Size 32.142 12.696 2.532 0.0146 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 36.3 on 49 degrees of freedom ## Multiple R-squared: 0.2766, Adjusted R-squared: 0.2324 ## F-statistic: 6.247 on 3 and 49 DF, p-value: 0.001121 ``` Suppose the predictor of interest is `Size`. The larger model includes `Beds`, `Lot`, and `Size`. ] --- # Identifying unsusal data points ## Added varb plots .scroll-output[ ```r mod1 <- lm(Price ~ Beds + Lot, HousesNY) mod1 %>% summary ``` ``` ## ## Call: ## lm(formula = Price ~ Beds + Lot, data = HousesNY) ## ## Residuals: ## Min 1Q Median 3Q Max ## -68.850 -30.035 -5.439 33.385 69.208 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 32.652 25.590 1.276 0.20787 ## Beds 22.803 6.838 3.335 0.00162 ** ## Lot 4.429 7.113 0.623 0.53634 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 38.21 on 50 degrees of freedom ## Multiple R-squared: 0.182, Adjusted R-squared: 0.1493 ## F-statistic: 5.563 on 2 and 50 DF, p-value: 0.006586 ``` The smaller model includes only `Beds` and `Lot`.] --- # Identifying unsusal data points ## Added varb plots .scroll-output[ Added variable plots would consider the effect of the predictors from the smaller model `Beds` and `Lot` in relation to the predictor of interest, `Size`. ```r mod2 <- lm(Size ~ Beds + Lot, HousesNY) mod2 %>% summary ``` ``` ## ## Call: ## lm(formula = Size ~ Beds + Lot, data = HousesNY) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.64501 -0.29638 0.02711 0.26535 1.06363 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.16325 0.27079 -0.603 0.549 ## Beds 0.55318 0.07236 7.645 5.92e-10 *** ## Lot -0.04683 0.07527 -0.622 0.537 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4044 on 50 degrees of freedom ## Multiple R-squared: 0.5601, Adjusted R-squared: 0.5425 ## F-statistic: 31.83 on 2 and 50 DF, p-value: 1.211e-09 ``` ] --- # Identifying unsusal data points ## Added varb plots .scroll-output[ Let's collect the residuals from the smaller model and the residuals from the regression of the predictor of interest (`Size`) on the predictors of the smaller model in a new data object. ```r added_varb_plot_data <- tibble(res1 = mod1$residuals, res2 = mod2$residuals) added_varb_plot_data %>% head ``` ``` ## # A tibble: 6 × 2 ## res1 res2 ## <dbl> <dbl> ## 1 -49.2 -0.475 ## 2 -50.5 -0.359 ## 3 24.9 -0.333 ## 4 38.4 -0.259 ## 5 -10.4 -0.148 ## 6 -29.8 0.0468 ``` ] --- # Identifying unsusal data points ## Added varb plots .scroll-output[ Now we'll create an added variable plot by plotting these two groups of residuals against one another. We want to look out for any strange data points to investigate them as potentially influential data points. ```r ggplot(added_varb_plot_data, aes(x=res2, y=res1)) + geom_point() + geom_smooth(method=lm, se=FALSE) + geom_abline(intercept=0) + geom_text(label=rownames(added_varb_plot_data), nudge_y = -5) + labs(title="Added variable plot for housing data", x="Residuals for regressing Beds on Size",y="Residuals for regressing Price on Beds") ``` ] --- # Identifying unsusal data points ## Added varb plots ![](Figs/unnamed-chunk-14-1.png)<!-- --> --- # Identifying unsusal data points ## Added varb plots .scroll-output[ Notice that the slope of the line of best fit for the added variable plot is actually the coefficient of the predictor `Size` in the larger model containing all three predictors. ```r resids_mod <- lm(res1 ~ res2, added_varb_plot_data) resids_mod %>% summary() ``` ``` ## ## Call: ## lm(formula = res1 ~ res2, data = added_varb_plot_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -64.549 -28.878 0.228 29.280 68.743 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.153e-15 4.887e+00 0.000 1.0000 ## res2 3.214e+01 1.244e+01 2.583 0.0127 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 35.58 on 51 degrees of freedom ## Multiple R-squared: 0.1157, Adjusted R-squared: 0.09834 ## F-statistic: 6.672 on 1 and 51 DF, p-value: 0.01271 ``` ] --- # Identifying unsusal data points ## Leverage `$$Leverage_i = h_i = \frac{1}{n} + \frac{x_i - \bar{x}}{\sum (x_i - \bar{x})^2}$$` Standardized residuals: `\(\frac{y_i - \bar{y}}{\hat{\sigma}\sqrt{1 - h_i}}\)` Studentized residuals: `\(\frac{y_i - \bar{y}}{\hat{\sigma}_{(i)}\sqrt{1 - h_i}}\)`, where `\(\hat{\sigma}_{(i)}\)` is the estimate from the model that doesn't include the `\(i^{th}\)` data point. --- # Identifying unsusal data points ## Leverage Other measures of influence include * Cook's distance * DFfits * DFbetas