This worksheet is to be completed in class as a group. Take a minute to have everyone introduce themselves and then choose who will play the role of recorder and reporter. The recorder is responsible for completing this worksheet and emailing the .Rmd file to your professor with the subject “Stat 21: Week 5 Worksheet”. Make sure everyone’s name is included at the top of this document! The reporter is responsible for taking note of your answers and any questions that come up to share them with Prof. Suzy or the Muse when they stop by to check in on your group. Other groups members are responsible for keeping the discussion on-task and providing possible solutions or sharing questions with the group.
For this assignment, I suggest that you choose the recorder to be whoever last ate cereal for breakfast and choose the reporter to be whoever hasn’t had cereal for breakfast in the longest time (or ever).
Create a data frame for PalmBeach.
data("PalmBeach")
names(PalmBeach)
## [1] "County" "Buchanan" "Bush"
Fit a model to predict Buchanan votes using Bush votes.
regall <- lm(Buchanan~Bush, data=PalmBeach)
regall %>% summary
##
## Call:
## lm(formula = Buchanan ~ Bush, data = PalmBeach)
##
## Residuals:
## Min 1Q Median 3Q Max
## -907.50 -46.10 -29.19 12.26 2610.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.529e+01 5.448e+01 0.831 0.409
## Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
## F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
Plot all data for the 2000 presidential election totals in Florida counties
ggplot(PalmBeach, aes(x=Bush, y=Buchanan)) +
geom_point() +
labs(title="Election totals in FL counties", x="Bush", y="Buchanan")
Create a residual plot for butterfly ballot data
#PalmBeach_all <- PalmBeach %>% mutate(resids = regall$residuals,
# fits = regall$fitted.values)
#ggplot(PalmBeach_all, aes(x=?, y=?)) +
# geom_point() +
# labs(title="Residual plot", subtitle="Bucahanan votes predicted with Bush votes", x="Fitted values", y="Residuals") +
# geom_hline(yintercept=0)
Now label each data point with the row number to make it easier to identify particular points.
#ggplot(PalmBeach_all, aes(x=fits, y=resids)) +
# geom_point() +
# labs(title="Residual plot", subtitle="Bucahanan votes predicted with Bush votes", x="Fitted values", y="Residuals") +
# geom_hline(yintercept=0) #+
# geom_text(label=rownames(PalmBeach_all))
Palm Beach County is observation 50, so we can find the standardized residual value for this data point as follows.
#sresids <- regall %>% rstandard
#sresids
#sresids[50]
Use the geom_histogram
function and the
ggplot
function to create a histogram of the standardized
residuals.
#PalmBeach_all2 <- PalmBeach_all %>% mutate(sresids = sresids)
# Use this space to make a histogram of the standardized residuals
Let’s find the value of the studentized residual and compare it to the standardized residual for observation 50.
#studresids <- regall %>% rstudent
#studresids
#studresids[50]
And compare a histogram of the studentized residuals as well.
#PalmBeach_all3 <- PalmBeach_all2 %>% mutate(studresids = studresids)
# Use this space to make a histogram of the studentized residuals
Let’s assess the regression model if we exclude the Palm Beach data
point using the filter
function. See how the dimensions of
the data sets differ.
#NoPalmBeach <- PalmBeach %>% filter(County!="PALM BEACH")
#PalmBeach %>% dim
#NoPalmBeach %>% dim
Fit the regression to the model without Palm Beach.
#regnoPB <- lm(Buchanan~Bush,data=NoPalmBeach)
#regnoPB %>% summary
Let’s compare the regression lines with and without Palm Beach
#regall %>% names
#regall$coefficients
Which line in the plot below represents the fitted regression model without using the Palm Beach data point?
#ggplot(PalmBeach, aes(x=Bush, y=Buchanan)) +
# geom_point() +
# labs(title="Election totals in FL counties", subtitle="SLR with an without influential point", x="Bush", y="Buchanan") +
# geom_abline(intercept = regall$coefficients[1], slope = regall$coefficients[2], lwd=2, col="darkblue") +
# geom_abline(intercept = regnoPB$coefficients[1], slope = regnoPB$coefficients[2], lty =2, lwd=2, col="darkblue")
## Use this space to create a Normal quantile plot for the residuals
Compute a confidence interval for the slope
#regall %>% confint
Use our model to predict the averagenumber of Buchanan votes for an observation of 103000 votes for Bush.
#PalmBeach %>% names
#new_x <- data.frame(Bush=103000)
#fit <- regall %>% predict(newdata = new_x)
#fit
Find a 95% confidence interval for the average number of Buchanan votes for an observation of 103000 votes for Bush.
#regall %>% predict.lm(new_x, interval='confidence')
Find a 95% prediction interval for the number of Buchanan votes for an observation of 103000 votes for Bush.
#regall %>% predict.lm(new_x, interval='prediction')
Edit the code below to create a plot with the CI for the mean
response and regression line by changing the se
parameter
to TRUE
.
#ggplot(data=PalmBeach, aes(x=Bush, y=Buchanan)) +
# geom_point() +
# geom_smooth(method=lm, se=FALSE)
Prediction intervals are a bit more involved. We need to create a new data fram that contains the original data and the upper and lower bounds of PIs for an unobserved response for a range of different predictor values. I’ll show you some code for this next class.