Unit 1 topics

Week 1 - Visualizing and summarizing data

For categorical variables

  • Contingency tables and independence

  • Simpson’s Paradox and lurking variables

Example

Below are the batting averages for Derek Jeter and David Justice for the 1995 and 1996 seasons, as well as their combined batting averages across the two seasons. Derek Jeter has a lower batting average in both 1995 and 1996, but a higher combined average. What’s going on here?

1995 1996 Combined
Derek Jeter \(0.250\) \(0.314\) \(0.310\)
David Justice \(0.253\) \(0.321\) \(0.270\)

Simpson’s paradox - the imbalance of the groups that makes it possible for the overall average to be weighted differently then the trend in each subgroup. The lurking variable that we couldn’t see if we just reported the proportions as in the table above is the number of times each player went up to bat in each year.

1995 1996 Combined
Derek Jeter \(0.250 = 12/48\) \(0.314=183/582\) \(0.310 = 195/630\)
David Justice \(0.253=104/411\) \(0.321=45/140\) \(0.270=149/551\)

Week 2 - Describing distributions of numeric variables with variance

Source: https://www.skymark.com/resources/tools/normal_test_plot.asp#:~:text=Right%20Skew%20%2D%20If%20the%20plotted,long%20tail%20to%20the%20left.

Example

Corey has 4929 songs in his computer’s music library. The lengths of the songs have a mean of 242.4 seconds and standard deviation of 114.51 seconds with the accompanying Normal probability plot of song lengths.

Q: Is this distribution Normal? If not, how does it differ from a Normal model?

For additional practice, see the plots on pg 145 of your textbook and the examples at the bottom of this webpage.

Week 3 - Describing relationships between two numeric variables

Example

A CEO complains that the winners of his “rookie junior executive of the year” award often turn out to have less impressive performance the following year. He wonders whether the award actually encourages them to slack off.

Q: What is a better explanation for why the winners of the “rookie junior executive of the year” award often turn out to have less impressive performance the following year?

Example

Q: An online investment blogger advises investing in mutual funds that have performed badly the past year because “regression to the mean tells us that they will do well next year.” Is she correct?

Watch out for

  • Extrapolation

  • Outliers, leverage, and influence

    • Outliers don’t seem to follow the trend of the majority of the data points.

    • Leverage points are outliers with respect to the \(x\)-axis only

    • Influential points are those data points that can drastically change the slope or intercept of the regression equation

  • Model assumptions met

    • Linearity between the predictor and response

    • No outliers

    • No thickening or thinning in the residuals plot

Questions about SLR modeling

1. Transformations

In SLR, transforming a variable can help with finding a linear relationship and it can help stabilize the variance of the residuals, both key assumptions for using a SLR model.

Refer to the solutions for our in-class worksheet on fitting a SLR model to predict gator weight based on the length of alligators in areal photographs.

2. When to use \(r\) vs \(R^2\)?

  • The only difference between \(r\) and \(R^2\) for SLR models is that \(r\) indicates not only the strength of a linear trend but also the direction of the trend.

  • \(R^2\) has a convenient interpretation as the proportion of variability in the response, \(Y\), that is explained (or “accounted for”) by the predictor, \(X\).

  • Both \(r\) and \(R^2\) are numerical summaries of strength of a relationship between two numeric variables; however, these summaries are only appropriate if the trend between the variables is linear.

  • In the gator model above, the regression model for the original data has a correlation coefficient of 0.9144007 and an \(R^2\) value of 0.8361287. The regression model for the transformed data has a correlation coefficient of 0.9798302 and an \(R^2\) value of 0.9600673.

3. Residual plots

In a SLR model, we can use residual plots to help us asses both the assumption of a linear association between our predictor and response and the assumption that the variance of our noise, \(\varepsilon\), is a single, constant number (that we can estimate with the standard error, \(SE = \sqrt{(\sum_{i=1}^{n} e_i^2)/(n-2)}\)).

For SLR models, a residual plot may have the predictor, \(x\), on the horizontal axis or it may have the fitted values, \(\hat{y}\), on the horizontal axis. The vertical axis of a residual plot is always the residuals, \(e\). In either case, we are hoping to see an amorphous blob of points in this scatter plot without any discernible patterns. You can find a few examples of “good” and “bad” residual plots on this website.