class: center, middle, inverse, title-slide # Defining and Interpreting Interaction Terms ### STAT 021 with Prof Suzy ### Swarthmore College --- <style type="text/css"> pre { background: #FFBB33; max-width: 100%; overflow-x: scroll; } .scroll-output { height: 70%; overflow-y: scroll; } .scroll-small { height: 50%; overflow-y: scroll; } .red{color: #ce151e;} .green{color: #26b421;} .blue{color: #426EF0;} </style> ## Polynomial Regression Models The `\(k\)`th-order polynomial model in one (predictor) variable is `$$Y|x = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k + \epsilon$$` We can also fit a polynomial regression model in two (or more) predictor variables, for example `$$Y \mid x = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{11} x_1^2 + \beta_{22} x_2^2 + \beta_{12} x_1 x_2 + \epsilon$$` Both of these models are still **linear** regression models because they are linear in terms of the coefficients. ### General Principles for Regression Model Complexity 1. Order of the model should be kept as low as possible 2. Extrapolation 3. Hierarchical models --- ## Polynomial Regression Models ### Shape of response surface .scroll-output[ <img src="Figs/class17-1.png" width="1037" height="400" style="display: block; margin: auto;" /><img src="Figs/class17-2.png" width="1015" height="400" style="display: block; margin: auto;" /><img src="Figs/class17-3.png" width="1083" height="400" style="display: block; margin: auto;" /> ] --- ## Polynomial Regression Models ### Interpreting the coefficients (and enumerating them) Consider the second order polynomial regression model of two predictors with an interaction term: `$$Y \mid x = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{11} x_1^2 + \beta_{22} x_2^2 + \beta_{12} x_1 x_2 + \epsilon.$$` When we solve for the least-squares estimates of the model coefficients, how do we interpret - `\(\hat{\beta}_{11}\)` - `\(\hat{\beta}_{12}\)` --- ## Interaction terms It is possible that the values of any two of these predictor variables may have an affect on a third predictor variable. And, it's possible that the effect is not additive.<sup>[1]</sup> A consequence of interaction between variables is that it makes it more difficult to predict the consequences of changing the value of any individual variable. **Q:** Why would we want to include interaction terms in a MLR? -- **A:** It allows our model to fit differently over different subsets of the data. *** -- **In practice:** - Predictor variables that have large main effects tend to also have large interactions with other predictor variables.<sup>[3]</sup> - When you include interaction terms in your model, it's a good idea to standardize your data before fitting the model. This helps with the interpretability of the model coefficients since interacting terms may not be measured in the same units.<sup>[3]</sup> --- ## Interaction Terms Example ### Public School SAT data Recall the data set for the SAT scores of public schools across each state of the US. .scroll-output[ ```r SAT_data <- read_table2("Data/sat_data.txt", col_names=FALSE, cols(col_character(), col_double(), col_double(), col_double(), col_double(), col_double(), col_double(), col_double())) colnames(SAT_data) = c("State", "PerPupilSpending", "StuTeachRatio", "Salary", "PropnStu", "SAT_verbal", "SAT_math", "SAT_tot") SAT_data <- SAT_data %>% mutate(prop_taking_SAT = PropnStu/100) %>% select(-PropnStu) head(SAT_data) ``` ``` ## # A tibble: 6 x 8 ## State PerPupilSpending StuTeachRatio Salary SAT_verbal SAT_math SAT_tot ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Alabama 4.40 17.2 31.1 491 538 1029 ## 2 Alaska 8.96 17.6 48.0 445 489 934 ## 3 Arizona 4.78 19.3 32.2 448 496 944 ## 4 Arkansas 4.46 17.1 28.9 482 523 1005 ## 5 California 4.99 24 41.1 417 485 902 ## 6 Colorado 5.44 18.4 34.6 462 518 980 ## # … with 1 more variable: prop_taking_SAT <dbl> ``` ```r names(SAT_data) ``` ``` ## [1] "State" "PerPupilSpending" "StuTeachRatio" "Salary" ## [5] "SAT_verbal" "SAT_math" "SAT_tot" "prop_taking_SAT" ``` ] --- ### Public School SAT data Recall the MLR model with four predictor variables that we built to predict SAT scores of public schools. .scroll-output[ ``` ## ## Call: ## lm(formula = SAT_tot ~ PerPupilSpending + StuTeachRatio + Salary + ## prop_taking_SAT, data = SAT_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -90.531 -20.855 -1.746 15.979 66.571 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1045.972 52.870 19.784 < 2e-16 *** ## PerPupilSpending 4.463 10.547 0.423 0.674 ## StuTeachRatio -3.624 3.215 -1.127 0.266 ## Salary 1.638 2.387 0.686 0.496 ## prop_taking_SAT -290.448 23.126 -12.559 2.61e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 32.7 on 45 degrees of freedom ## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809 ## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16 ``` ] --- ### Public School SAT data Recall the MLR model with four predictor variables that we built to predict SAT scores of public schools. .scroll-output[ <img src="week12-part1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ## Interaction Terms Example ### Public School SAT data Since we want to investigate possible interactions let's standardize all of the quantitative predictor variables in this data set to get rid of all units for each variable. .scroll-small[ ```r SAT_data_standard <- SAT_data %>% mutate_at(vars("PerPupilSpending", "StuTeachRatio", "Salary", "prop_taking_SAT"), funs(scale)) MLR_SAT_standard <- lm(SAT_tot ~ PerPupilSpending + StuTeachRatio + Salary + prop_taking_SAT, data = SAT_data_standard) summary(MLR_SAT_standard) ``` ``` ## ## Call: ## lm(formula = SAT_tot ~ PerPupilSpending + StuTeachRatio + Salary + ## prop_taking_SAT, data = SAT_data_standard) ## ## Residuals: ## Min 1Q Median 3Q Max ## -90.531 -20.855 -1.746 15.979 66.571 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 965.920 4.625 208.858 < 2e-16 *** ## PerPupilSpending 6.082 14.373 0.423 0.674 ## StuTeachRatio -8.214 7.287 -1.127 0.266 ## Salary 9.731 14.183 0.686 0.496 ## prop_taking_SAT -77.731 6.189 -12.559 2.61e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 32.7 on 45 degrees of freedom ## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809 ## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16 ``` ] Now that we've made sure all of the predictor variables are unitless, it's clear that the variable with the largest estimated effect on SAT scores is the proportion of students taking the exam. --- ## Interaction Terms Example ### Public School SAT data Let's see if there are any interaction effects between the proportion of students taking the exam and the other variables. .scroll-small[ ```r MLR_SAT_interaction <- lm(SAT_tot ~ PerPupilSpending + StuTeachRatio + Salary + prop_taking_SAT + prop_taking_SAT*PerPupilSpending + prop_taking_SAT*StuTeachRatio + prop_taking_SAT*Salary, data = SAT_data_standard) summary(MLR_SAT_interaction) ``` ``` ## ## Call: ## lm(formula = SAT_tot ~ PerPupilSpending + StuTeachRatio + Salary + ## prop_taking_SAT + prop_taking_SAT * PerPupilSpending + prop_taking_SAT * ## StuTeachRatio + prop_taking_SAT * Salary, data = SAT_data_standard) ## ## Residuals: ## Min 1Q Median 3Q Max ## -85.330 -14.056 -5.881 20.180 66.456 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 960.184 5.421 177.135 < 2e-16 *** ## PerPupilSpending 5.546 14.716 0.377 0.708 ## StuTeachRatio -8.086 7.256 -1.114 0.271 ## Salary 8.224 14.002 0.587 0.560 ## prop_taking_SAT -80.575 6.344 -12.701 5.65e-16 *** ## PerPupilSpending:prop_taking_SAT -9.224 14.508 -0.636 0.528 ## StuTeachRatio:prop_taking_SAT -12.199 7.343 -1.661 0.104 ## Salary:prop_taking_SAT 14.138 12.985 1.089 0.282 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 31.69 on 42 degrees of freedom ## Multiple R-squared: 0.8463, Adjusted R-squared: 0.8206 ## F-statistic: 33.03 on 7 and 42 DF, p-value: 4.178e-15 ``` ] --- ## Interaction Terms Example ### Public School SAT data Although, based on the R output, it doesn't look like there's any interaction between the proportion of students and the other three variables, let's consider the mathematical form of the interaction model on the previous slide. If `\(Y=\)`SAT score, `\(x_1=\)`per pupil spending, `\(x_2=\)`student teacher ratio, `\(x_3=\)`teacher salary, and `\(x_4=\)` proportion of eligible students who took the SAT, then the model with the three interaction terms from before looks like: `$$Y \mid (x_1, x_2, x_3, x_4) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_{14} (x_1 x_4) + \beta_{24} (x_2 x_4) + \beta_{34} (x_3 x_4) + \epsilon$$` -- which can be rearranged to look like `$$Y \mid (x_1, x_2, x_3, x_4) = \beta_0 + (\beta_1 + \beta_{14} x_4)x_1 + (\beta_2 + \beta_{24} x_4)x_2 + (\beta_3 + \beta_{34} x_4)x_3 + \beta_4 x_4 +\epsilon.$$` --- ## Interaction Terms Example ### Public School SAT data The interpretation of the interaction terms is that we expect the effect on SAT scores by changing `\(x_1\)`, for instance, depends also on the value of `\(x_4\)`. In our specific model for example, if we consider the interaction between student teacher ratio and the proportion of students, that we expect the effect that the school's student teacher ratio has on SAT scores depends on whether the proportion of students taking the exam is low or high. --- ## Reading along in your textbook Chapter 3 Section 1, Chapter 7 Sections 1, 2, and 4 *** ## References [1] https://stats.stackexchange.com/questions/113733/what-is-the-difference-between-collinearity-and-interaction [2] https://datascienceplus.com/multicollinearity-in-r/ [3] Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill. Cambridge University Press. (2007) [4] Linear Statistical Models by James Stapleton. Wiley Series in Probability and Statistics. (2009) [5] Linear Regression Analysis by George Seber and Alan Lee. Wiley Series in Probability and Statistics. (2003)