This worksheet is to be completed in class as a group. Take a minute to have everyone introduce themselves and then choose who will play the role of recorder and reporter. The recorder is responsible for completing this worksheet and emailing the .Rmd file to your professor with the subject “Stat 21: Week 5 Worksheet”. Make sure everyone’s name is included at the top of this document! The reporter is responsible for taking note of your answers and any questions that come up to share them with Prof. Suzy or the Muse when they stop by to check in on your group. Other groups members are responsible for keeping the discussion on-task and providing possible solutions or sharing questions with the group.

For this assignment, I suggest that you choose the recorder to be whoever last ate cereal for breakfast and choose the reporter to be whoever hasn’t had cereal for breakfast in the longest time (or ever).

Step 1: Choose

Create a data frame for PalmBeach.

data("PalmBeach")
names(PalmBeach)
## [1] "County"   "Buchanan" "Bush"

Step 2: Fit

Fit a model to predict Buchanan votes using Bush votes.

regall <- lm(Buchanan~Bush, data=PalmBeach)
regall %>% summary
## 
## Call:
## lm(formula = Buchanan ~ Bush, data = PalmBeach)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907.50  -46.10  -29.19   12.26 2610.19 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.529e+01  5.448e+01   0.831    0.409    
## Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3795 
## F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

Plot all data for the 2000 presidential election totals in Florida counties

ggplot(PalmBeach, aes(x=Bush, y=Buchanan)) + 
  geom_point() + 
  labs(title="Election totals in FL counties", x="Bush", y="Buchanan") 

Step 3: Assess

Create a residual plot for butterfly ballot data

Residuals plot

#PalmBeach_all <- PalmBeach %>% mutate(resids = regall$residuals,
#                                        fits = regall$fitted.values)
#ggplot(PalmBeach_all, aes(x=?, y=?)) + 
#  geom_point() + 
#  labs(title="Residual plot", subtitle="Bucahanan votes predicted with Bush votes", x="Fitted values", y="Residuals") + 
#  geom_hline(yintercept=0)

Now label each data point with the row number to make it easier to identify particular points.

#ggplot(PalmBeach_all, aes(x=fits, y=resids)) + 
#    geom_point() + 
#    labs(title="Residual plot", subtitle="Bucahanan votes predicted with Bush votes", x="Fitted values", y="Residuals") + 
#    geom_hline(yintercept=0) #+
#    geom_text(label=rownames(PalmBeach_all)) 

Standardized and studentized residuals

Palm Beach County is observation 50, so we can find the standardized residual value for this data point as follows.

#sresids <- regall %>% rstandard 
#sresids
#sresids[50]

Use the geom_histogram function and the ggplot function to create a histogram of the standardized residuals.

#PalmBeach_all2 <- PalmBeach_all %>% mutate(sresids = sresids)
# Use this space to make a histogram of the standardized residuals 

Let’s find the value of the studentized residual and compare it to the standardized residual for observation 50.

#studresids <- regall %>% rstudent
#studresids
#studresids[50]

And compare a histogram of the studentized residuals as well.

#PalmBeach_all3 <- PalmBeach_all2 %>% mutate(studresids = studresids)
# Use this space to make a histogram of the studentized residuals 

Let’s assess the regression model if we exclude the Palm Beach data point using the filter function. See how the dimensions of the data sets differ.

#NoPalmBeach <- PalmBeach %>% filter(County!="PALM BEACH")
#PalmBeach %>% dim 
#NoPalmBeach %>% dim

Fit the regression to the model without Palm Beach.

#regnoPB <- lm(Buchanan~Bush,data=NoPalmBeach)
#regnoPB %>% summary 

Let’s compare the regression lines with and without Palm Beach

#regall %>% names
#regall$coefficients

Which line in the plot below represents the fitted regression model without using the Palm Beach data point?

#ggplot(PalmBeach, aes(x=Bush, y=Buchanan)) + 
#  geom_point() + 
#  labs(title="Election totals in FL counties", subtitle="SLR with an without influential point", x="Bush", y="Buchanan") + 
#  geom_abline(intercept = regall$coefficients[1], slope = regall$coefficients[2], lwd=2, col="darkblue") +
#  geom_abline(intercept = regnoPB$coefficients[1], slope = regnoPB$coefficients[2], lty =2, lwd=2, col="darkblue")

Normal quantile plot

## Use this space to create a Normal quantile plot for the residuals 

Step 4: Use

Confidence interval for the slope of the predictor

Compute a confidence interval for the slope

#regall %>% confint

Prediction

Use our model to predict the averagenumber of Buchanan votes for an observation of 103000 votes for Bush.

#PalmBeach %>% names
#new_x <- data.frame(Bush=103000)
#fit <- regall %>% predict(newdata = new_x)
#fit

Confidence interval for the mean response

Find a 95% confidence interval for the average number of Buchanan votes for an observation of 103000 votes for Bush.

#regall %>% predict.lm(new_x, interval='confidence')

Prediction interval for an unobserved value of the response

Find a 95% prediction interval for the number of Buchanan votes for an observation of 103000 votes for Bush.

#regall %>% predict.lm(new_x, interval='prediction')

Visualizing CIs and PIs for the response

Edit the code below to create a plot with the CI for the mean response and regression line by changing the se parameter to TRUE.

#ggplot(data=PalmBeach, aes(x=Bush, y=Buchanan)) + 
#  geom_point() + 
#  geom_smooth(method=lm, se=FALSE)   

Prediction intervals are a bit more involved. We need to create a new data fram that contains the original data and the upper and lower bounds of PIs for an unobserved response for a range of different predictor values. I’ll show you some code for this next class.