class: center, middle, inverse, title-slide .title[ # Week 4 - Part 1 ] .subtitle[ ## Inference with SLR Models ] .author[ ### Suzanne Thornton ] .institute[ ### Swarthmore College ] .date[ ### For class in Week 9 (updated: 2023-07-31) ] --- <style type="text/css"> pre { background: #FFBB33; max-width: 100%; overflow-x: scroll; } .scroll-output { height: 70%; overflow-y: scroll; } .scroll-small { height: 50%; overflow-y: scroll; } .red{color: #ce151e;} .green{color: #26b421;} .blue{color: #426EF0;} </style> ## The problem of .red[inference] ### T-test This test helps us determine if the predictor variable, *x*, is statistically significant by conducting a t-test on it's estimated regression coefficient. -- `$$H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0$$` `$$\text{Under }H_0: T.S. = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} \sim t_{(n-2)}$$` -- To find `\(SE(\hat{\beta}_1)\)`, we will use `\(SS_{res}\)`: `$$SE(\hat{\beta}_1) = \sqrt{\frac{SS_{res}}{\sum_{i=1}^n(x_i - \bar{x})^2}}.$$` --- ## The problem of .red[inference] ### F-test This tests for the overall fit of the linear regression model. In general, the F-test for a linear regression model helps us determine if any of the predictor variables we included are statistically significant. `$$\text{Under }H_0: T.S. = \frac{\frac{SS_{reg}}{1}}{\frac{SS_{res}}{n-2}} \sim F_{(1,n-2)}$$` -- In **SLR** the F-test and the t-test for `\(\hat{\beta}_1\)` are testing the same thing since there is only one predictor variable in the regression model. In **MLR** (multiple linear regression) we will have more than one predictor variable and so this equality does not hold. `$$H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0$$` `$$Y \mid x = \beta_0 + \beta_1 x + \epsilon$$` --- ## Copy and paste this code to follow along with me ```r hc_employer_2013 <- read_table2( url("http://www.swarthmore.edu/NatSci/sthornt1/DataFiles/health_care_data.txt")) colnames(hc_employer_2013) <- c("Location", "TotalHealthSpending", "TotalUninsured", "TotalPop", "spending_capita", "prop_uninsured") SLR_hc <- lm(prop_uninsured ~ spending_capita, hc_employer_2013) hc_resid_data <- hc_employer_2013 %>% mutate(residuals = SLR_hc$residuals, fitted_vals = SLR_hc$fitted.values) ``` --- ## The problem of .red[inference] ### Where to find this information in R? .scroll-output[ ``` ## ## Call: ## lm(formula = prop_uninsured ~ spending_capita, data = hc_employer_2013) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.086642 -0.016766 0.002575 0.013199 0.073975 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.642e-01 2.744e-02 9.627 7.0e-13 *** ## spending_capita -1.759e-06 3.364e-07 -5.230 3.5e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.02946 on 49 degrees of freedom ## Multiple R-squared: 0.3582, Adjusted R-squared: 0.3451 ## F-statistic: 27.35 on 1 and 49 DF, p-value: 3.503e-06 ``` ``` ## Analysis of Variance Table ## ## Response: prop_uninsured ## Df Sum Sq Mean Sq F value Pr(>F) ## spending_capita 1 0.023730 0.0237300 27.35 3.503e-06 *** ## Residuals 49 0.042514 0.0008676 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- ## The problem of .red[inference] ### Confidence intervals for model coefficients Often we are interested not just in a decision "is `\(x\)` significant or not?" but also in the question "what range of values are good guesses for the effect of `\(x\)` on the response? To answer the latter question we can use the theory behind the t-test from before to also find a confidence interval for `\(\hat{\beta}_1\)`: `$$\hat{\beta}_{1} \pm t_{\alpha/2,n-2} \times SE(\hat{\beta}_1)$$` --- ## The problem of .red[inference] ### Confidence and prediction intervals for the response Suppose we are interested in an interval range of possible values for the response variable. There are two types of intervals we may want to consider: prediction intervals for a new observation or confidence intervals for the regression equation. For confidence intervals of the regression equation, we are quantifying how confident we are in the possible values of the **average response**: `$$\hat{y} \pm t_{\alpha/2, n-2}SE(\hat{y})$$` where `\(SE(\hat{y}) = \sum_{i=1}^{n}(y_i - \bar{y})(x_i - \bar{x})\sqrt{\frac{1}{n} + \frac{(x_{new}-\bar{x})^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}}\)`. For prediction intervals for a new observation, we are quantifying how confident we are in the value of a particular **predicted response**: `$$\hat{y}_{new} \pm t_{\alpha/2, n-2}SE(\hat{y}_{new})$$` where `\(SE(\hat{y}_{new}) = \sum_{i=1}^{n}(y_i - \bar{y})(x_i - \bar{x})\sqrt{1 + \frac{1}{n} + \frac{(x_{new}-\bar{x})^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}}\)`. --- ## Confidence and prediction intervals The prediction interval is designed to cover a “moving target”, the random future value of `\(y\)`, while the confidence interval is designed to cover the “fixed target”, the average (expected) value of `\(Y\)`, `\(E(Y)\)`. There is more uncertainty associated with new unobserved data points than there is with the average behavior of the data. So prediction intervals will never be more narrow than confidence intervals! <a href="https://www.youtube.com/watch?feature=player_embedded&v=qVCQi0KPR0s">This</a> is a 10 minute video going through another example of the difference between confidence intervals and prediction intervals. For more on the distinction between prediction intervals and confidence intervals go <a href="https://stats.stackexchange.com/questions/16493/difference-between-confidence-intervals-and-prediction-intervals">here</a>. --- ## Confidence intervals and prediction intervals ### In R Now let's go back to the health care data set. ```r new_hc_data = data.frame(spending_capita = 45000) ``` .scroll-small[ ```r predict(SLR_hc, new_hc_data, interval="confidence", level = 0.95) ``` ``` ## fit lwr upr ## 1 0.1850201 0.1595354 0.2105049 ``` ```r predict(SLR_hc, new_hc_data, interval="predict", level = 0.95) ``` ``` ## fit lwr upr ## 1 0.1850201 0.1205737 0.2494665 ``` ] --- ## Confidence intervals and prediction intervals ```r CI_bounds <- as_tibble(predict(SLR_hc, hc_employer_2013, interval="confidence", level = 0.95)) PI_bounds <- as_tibble(predict(SLR_hc, hc_employer_2013, interval="predict", level = 0.95)) new_hc_data <- bind_cols(hc_employer_2013, CI_bounds, PI_bounds[,2:3]) %>% as.tibble(.name_repair="universal") ## note the prediction interval bounds at the one's with "1" appended to the name ggplot(new_hc_data, aes(x=spending_capita, y=prop_uninsured)) + geom_ribbon(aes(ymin=lwr...10, ymax=upr...11, fill='prediction'),alpha=0.3) + geom_smooth(method="lm", se=TRUE, aes(fill='confidence'), alpha=0.3) + geom_point() + labs(title="Scatterplot of the data and the regression line", subtitle="Confidence and prediction intervals", y='Proportion of people uninsured', x='Per capita spending') ``` --- ## Confidence intervals and prediction intervals <img src="Figs/pred_and_conf_ints_class14-1.png" height="500" style="display: block; margin: auto;" />