Exploring and understanding data
Exploring and understanding relationships between variables
Data
Observational units
Variables - categorical or numeric - and their distributions
Analysis
Confirmatory or exploratory
Sample statistics
Contingency tables and independence
Simpson’s Paradox and lurking variables
Below are the batting averages for Derek Jeter and David Justice for the 1995 and 1996 seasons, as well as their combined batting averages across the two seasons. Derek Jeter has a lower batting average in both 1995 and 1996, but a higher combined average. What’s going on here?
1995 | 1996 | Combined | |
---|---|---|---|
Derek Jeter | \(0.250\) | \(0.314\) | \(0.310\) |
David Justice | \(0.253\) | \(0.321\) | \(0.270\) |
Simpson’s paradox - the imbalance of the groups that makes it possible for the overall average to be weighted differently then the trend in each subgroup. The lurking variable that we couldn’t see if we just reported the proportions as in the table above is the number of times each player went up to bat in each year.
1995 | 1996 | Combined | |
---|---|---|---|
Derek Jeter | \(0.250 = 12/48\) | \(0.314=183/582\) | \(0.310 = 195/630\) |
David Justice | \(0.253=104/411\) | \(0.321=45/140\) | \(0.270=149/551\) |
Shape and measures of center and spread
Mean vs median
Variance vs IQR
Normal model and the 68/95/99.7 rule
Transforming variables
Shifts and scales
Standardization
Quantiles and quartiles
Source: https://www.skymark.com/resources/tools/normal_test_plot.asp#:~:text=Right%20Skew%20%2D%20If%20the%20plotted,long%20tail%20to%20the%20left.
Corey has 4929 songs in his computer’s music library. The lengths of the songs have a mean of 242.4 seconds and standard deviation of 114.51 seconds with the accompanying Normal probability plot of song lengths.
Q: Is this distribution Normal? If not, how does it differ from a Normal model?
For additional practice, see the plots on pg 145 of your textbook and the examples at the bottom of this webpage.
Simple linear regression model
Predicted/fitted values and residuals
Standard error of the residuals
Correlation and the coefficient of determination
Variable roles
Variable transformations - transforming a numeric variable can change its shape and spread
Regression to the mean
A CEO complains that the winners of his “rookie junior executive of the year” award often turn out to have less impressive performance the following year. He wonders whether the award actually encourages them to slack off.
Q: What is a better explanation for why the winners of the “rookie junior executive of the year” award often turn out to have less impressive performance the following year?
Q: An online investment blogger advises investing in mutual funds that have performed badly the past year because “regression to the mean tells us that they will do well next year.” Is she correct?
Extrapolation
Outliers, leverage, and influence
Outliers don’t seem to follow the trend of the majority of the data points.
Leverage points are outliers with respect to the \(x\)-axis only
Influential points are those data points that can drastically change the slope or intercept of the regression equation
Model assumptions met
Linearity between the predictor and response
No outliers
No thickening or thinning in the residuals plot
In SLR, transforming a variable can help with finding a linear relationship and it can help stabilize the variance of the residuals, both key assumptions for using a SLR model.
Refer to the solutions for our in-class worksheet on fitting a SLR model to predict gator weight based on the length of alligators in areal photographs.
The only difference between \(r\) and \(R^2\) for SLR models is that \(r\) indicates not only the strength of a linear trend but also the direction of the trend.
\(R^2\) has a convenient interpretation as the proportion of variability in the response, \(Y\), that is explained (or “accounted for”) by the predictor, \(X\).
Both \(r\) and \(R^2\) are numerical summaries of strength of a relationship between two numeric variables; however, these summaries are only appropriate if the trend between the variables is linear.
In the gator model above, the regression model for the original data has a correlation coefficient of 0.9144007 and an \(R^2\) value of 0.8361287. The regression model for the transformed data has a correlation coefficient of 0.9798302 and an \(R^2\) value of 0.9600673.
In a SLR model, we can use residual plots to help us asses both the assumption of a linear association between our predictor and response and the assumption that the variance of our noise, \(\varepsilon\), is a single, constant number (that we can estimate with the standard error, \(SE = \sqrt{(\sum_{i=1}^{n} e_i^2)/(n-2)}\)).
For SLR models, a residual plot may have the predictor, \(x\), on the horizontal axis or it may have the fitted values, \(\hat{y}\), on the horizontal axis. The vertical axis of a residual plot is always the residuals, \(e\). In either case, we are hoping to see an amorphous blob of points in this scatter plot without any discernible patterns. You can find a few examples of “good” and “bad” residual plots on this website.