class: center, middle, inverse, title-slide # MLR Example ### STAT 021 with Prof Suzy ### Swarthmore College --- <style type="text/css"> pre { background: #FFBB33; max-width: 100%; overflow-x: scroll; } .scroll-output { height: 70%; overflow-y: scroll; } .scroll-small { height: 50%; overflow-y: scroll; } .red{color: #ce151e;} .green{color: #26b421;} .blue{color: #426EF0;} </style> ## MLR with only numerical predictor variables ### Public school SAT data .scroll-output[ ```r SAT_data <- read_table2("Data/sat_data.txt", col_names=FALSE, cols(col_character(), col_double(), col_double(), col_double(), col_double(), col_double(), col_double(), col_double())) colnames(SAT_data) = c("State", "PerPupilSpending", "StuTeachRatio", "Salary", "PropnStu", "SAT_verbal", "SAT_math", "SAT_tot") SAT_data <- SAT_data %>% mutate(prop_taking_SAT = PropnStu/100) %>% select(-PropnStu) head(SAT_data) ##note: please dont print out entire data sets in your HW solutions! ``` ``` ## # A tibble: 6 x 8 ## State PerPupilSpending StuTeachRatio Salary SAT_verbal SAT_math SAT_tot ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Alabama 4.40 17.2 31.1 491 538 1029 ## 2 Alaska 8.96 17.6 48.0 445 489 934 ## 3 Arizona 4.78 19.3 32.2 448 496 944 ## 4 Arkansas 4.46 17.1 28.9 482 523 1005 ## 5 California 4.99 24 41.1 417 485 902 ## 6 Colorado 5.44 18.4 34.6 462 518 980 ## # … with 1 more variable: prop_taking_SAT <dbl> ``` ```r names(SAT_data) ``` ``` ## [1] "State" "PerPupilSpending" "StuTeachRatio" "Salary" ## [5] "SAT_verbal" "SAT_math" "SAT_tot" "prop_taking_SAT" ``` ] --- ## Public school SAT data ### What are the factors affecting SAT scores? <img src="Figs/class15_SLR-1.png" height="400" style="display: block; margin: auto;" /> --- ## Public school SAT data ### First: Visualize 2D plots for every pair of predictors .scroll-output[ ```r SAT_data %>% select(-State) %>% pairs(pch=16) ``` <img src="Figs/class15_scatter-1.png" style="display: block; margin: auto;" /> ] --- ## Public school SAT data ### First: Visualize 2D plots for every pair of predictors .scroll-output[ ```r SAT_data %>% select(-c(State, SAT_verbal, SAT_math)) %>% pairs(pch=16) ``` <img src="Figs/class15_scatter2-1.png" style="display: block; margin: auto;" /> ] --- ## Public school SAT data ### Second: Fit a linear regression model and plot the residuals .scroll-output[ ```r MLR_SAT <- lm(SAT_tot ~ PerPupilSpending + StuTeachRatio, data = SAT_data) summary(MLR_SAT) ``` ``` ## ## Call: ## lm(formula = SAT_tot ~ PerPupilSpending + StuTeachRatio, data = SAT_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -147.694 -51.816 6.258 37.756 127.742 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1136.336 107.803 10.541 5.69e-14 *** ## PerPupilSpending -22.308 7.956 -2.804 0.00731 ** ## StuTeachRatio -2.295 4.784 -0.480 0.63370 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 70.48 on 47 degrees of freedom ## Multiple R-squared: 0.149, Adjusted R-squared: 0.1128 ## F-statistic: 4.114 on 2 and 47 DF, p-value: 0.02258 ``` ] --- ## Public school SAT data ### Second: Fit a linear regression model and plot the residuals .scroll-output[ <img src="Figs/class15_3dplot2-1.png" height="400" style="display: block; margin: auto;" /> ] --- ## Public school SAT data ### Second: Fit a linear regression model and plot the residuals .scroll-output[ <img src="Figs/class15_resid-1.png" height="400" style="display: block; margin: auto;" /> ] --- ## Public school SAT data ### Second: Fit a linear regression model and plot the residuals ```r SAT_resid_data <- SAT_data %>% mutate(residuals = MLR_SAT$residuals, fitted_vals = MLR_SAT$fitted.values) ggplot(SAT_resid_data) + geom_point(aes(x=fitted_vals, y=residuals)) + labs(title="Residual plot", subtitle="Public school SAT scores", x="Predicted SAT Scores", y="Residuals") + geom_hline(yintercept=0) + ylim(-200,200) ```