Unit 1 topics

Exploring and understanding data
Exploring and understanding relationships between variables

Week 1 - Visualizing and summarizing data

Data
- Observational units
- Variables - categorical or numeric - and their distributions
Analysis
- Confirmatory or exploratory
- Sample statistics

For categorical variables

Contingency tables and independence
Simpson’s Paradox and lurking variables

Example

Below are the batting averages for Derek Jeter and David Justice for the 1995 and 1996 seasons, as well as their combined batting averages across the two seasons. Derek Jeter has a lower batting average in both 1995 and 1996, but a higher combined average. What’s going on here?

	1995	1996	Combined
Derek Jeter	\(0.250\)	\(0.314\)	\(0.310\)
David Justice	\(0.253\)	\(0.321\)	\(0.270\)

Simpson’s paradox - the imbalance of the groups that makes it possible for the overall average to be weighted differently then the trend in each subgroup. The lurking variable that we couldn’t see if we just reported the proportions as in the table above is the number of times each player went up to bat in each year.

	1995	1996	Combined
Derek Jeter	\(0.250 = 12/48\)	\(0.314=183/582\)	\(0.310 = 195/630\)
David Justice	\(0.253=104/411\)	\(0.321=45/140\)	\(0.270=149/551\)

Week 2 - Describing distributions of numeric variables with variance

Shape and measures of center and spread
- Mean vs median
- Variance vs IQR
Normal model and the 68/95/99.7 rule
Transforming variables
- Shifts and scales
- Standardization
Quantiles and quartiles
- Normal QQ Plots - vertical axis plots the sample quantiles (that is, the sample arranged from smallest to largest), the horizontal axis plots the standardized z-scores for each of the sample quantiles. If the data are from a Normal distribution, then the standardized z-scores should follow a Standard Normal distribution. A strong, positive linear relationship between the z-score quantiles (that is, the theoretical quantiles from a N(0,1) distribution) and the sample quantiles indicates that your sample is from a Normal distribution.

Source: https://www.skymark.com/resources/tools/normal_test_plot.asp#:~:text=Right%20Skew%20%2D%20If%20the%20plotted,long%20tail%20to%20the%20left.

Example

Corey has 4929 songs in his computer’s music library. The lengths of the songs have a mean of 242.4 seconds and standard deviation of 114.51 seconds with the accompanying Normal probability plot of song lengths.

Q: Is this distribution Normal? If not, how does it differ from a Normal model?

For additional practice, see the plots on pg 145 of your textbook and the examples at the bottom of this webpage.

Week 3 - Describing relationships between two numeric variables

Simple linear regression model
- Predicted/fitted values and residuals
- Standard error of the residuals
Correlation and the coefficient of determination
Variable roles
Variable transformations - transforming a numeric variable can change its shape and spread
Regression to the mean

Example

A CEO complains that the winners of his “rookie junior executive of the year” award often turn out to have less impressive performance the following year. He wonders whether the award actually encourages them to slack off.

Q: What is a better explanation for why the winners of the “rookie junior executive of the year” award often turn out to have less impressive performance the following year?

Example

Q: An online investment blogger advises investing in mutual funds that have performed badly the past year because “regression to the mean tells us that they will do well next year.” Is she correct?

Watch out for

Extrapolation
Outliers, leverage, and influence
- Outliers don’t seem to follow the trend of the majority of the data points.
- Leverage points are outliers with respect to the \(x\)-axis only
- Influential points are those data points that can drastically change the slope or intercept of the regression equation
Model assumptions met
- Linearity between the predictor and response
- No outliers
- No thickening or thinning in the residuals plot

Questions about SLR modeling

1. Transformations

In SLR, transforming a variable can help with finding a linear relationship and it can help stabilize the variance of the residuals, both key assumptions for using a SLR model.

Refer to the solutions for our in-class worksheet on fitting a SLR model to predict gator weight based on the length of alligators in areal photographs.

2. When to use \(r\) vs \(R^2\)?

The only difference between \(r\) and \(R^2\) for SLR models is that \(r\) indicates not only the strength of a linear trend but also the direction of the trend.
\(R^2\) has a convenient interpretation as the proportion of variability in the response, \(Y\), that is explained (or “accounted for”) by the predictor, \(X\).
Both \(r\) and \(R^2\) are numerical summaries of strength of a relationship between two numeric variables; however, these summaries are only appropriate if the trend between the variables is linear.
In the gator model above, the regression model for the original data has a correlation coefficient of 0.9144007 and an \(R^2\) value of 0.8361287. The regression model for the transformed data has a correlation coefficient of 0.9798302 and an \(R^2\) value of 0.9600673.

3. Residual plots

In a SLR model, we can use residual plots to help us asses both the assumption of a linear association between our predictor and response and the assumption that the variance of our noise, \(\varepsilon\), is a single, constant number (that we can estimate with the standard error, \(SE = \sqrt{(\sum_{i=1}^{n} e_i^2)/(n-2)}\)).

For SLR models, a residual plot may have the predictor, \(x\), on the horizontal axis or it may have the fitted values, \(\hat{y}\), on the horizontal axis. The vertical axis of a residual plot is always the residuals, \(e\). In either case, we are hoping to see an amorphous blob of points in this scatter plot without any discernible patterns. You can find a few examples of “good” and “bad” residual plots on this website.

Stat 11 Review of Unit 1

Prof. Suzy Thornton

Unit 1 topics

Week 1 - Visualizing and summarizing data

For categorical variables

Example

Week 2 - Describing distributions of numeric variables with variance

Example

Week 3 - Describing relationships between two numeric variables

Example

Example

Watch out for

Questions about SLR modeling

1. Transformations

2. When to use \(r\) vs \(R^2\)?

3. Residual plots