Exploratory data analysis - Numeric Variables

Recap

Last week we discussed the structure of statistical data and the kinds of questions we can answer from a statistical analysis.

Data consists of observational units (rows) and variables (columns). We covered a few ways to conduct an exploratory data analysis for numeric and/or categorical variables. Here you can find a couple of examples of what data might look like if we want to conduct a confirmatory analysis of the claim that Swarthmore grads have a higher acceptance rate for grad school compared to other schools.

For the next couple of weeks we are going to focus specifically on exploring numeric data.

The structure of numeric data

For a set of \(n\) numeric data points, \(x_1, x_2, \dots, x_{n-1}, x_n\), that correspond to the same variable but different observational units, the best way to understand numerical data is to create a histogram with a reasonable number of bins. The next best way to understand numerical data is to consider numerical summaries of the data (i.e. sample statistics) that describe the center and the spread of the data values.

Examples

Example 1: The time it takes Olympic runners to complete a race in seconds.

Example 2: The length of Olympic hammer throws in meters.

1. Numerical measures of centrality

Suppose the numeric data is measuring centimeters. What are the units of measurement for the mean? For the median? For the mode?

When to use which?

For data that is

  • unimodal (exactly one peak)

  • symmetric about the mode (mirrored at the mode)

the mode, the mean, and the median are all the same value!

For data that is unimodal but not symmetric…

The median is a better representation of the center of skewed data.

For data that is not unimodal but is symmetric…

It is useful to know how many modes occur and where they are located.

For data that is not unimodal and not symmetric…

Histograms are indespensible because numerical summaries can be misleading by themselves.

2. Numerical measures of spread

When to use which?

For data that is

  • unimodal

  • symmetric about the mode

we use variance (or standard deviation) because these represent average distances from the mean.

For data that is unimodal but not symmetric, i.e. skewed, the inter-quartile range or the range are more realistically informative about the spread of the data than the standard deviation.

For data that is not unimodal but is symmetric… or for data that is not unimodal and not symmetric… there isn’t necessarily a preferable way to numerically summarize the spread. But in general, the more information (e.g. plots, ranges, etc) the better our description of the numeric data!

3. What is variance (or standard deviation)?

Variance of a sample (more commonly called “sample variance”): \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 = \frac{1}{n-1}\left((x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dots + (x_{n-1} - \bar{x})^2 + (x_n - \bar{x})^2 \right)\)

Standard deviation of a sample (more commonly called sample standard deviation): \(s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}\)

Suppose the numeric data is measuring centimeters. What are the units of measurement for the variance? For standard deviation?

Example

Image credits: Denes Erdos/AP (L) Charlie Crowhurst/Getty Images for IAAF (R)

The 2023 World Athletics Championships in Budapest saw many athletes perform at their personal best in preparation for the Olympics. Sha’Carri Richardson won first place for the USA in the women’s \(100m\) sprint with a time of \(10.65s\), which was \(0.252 s\) faster than the average time of her top \(10\) competitors. Camryn Rogers of Canada won the women’s hammer throw with a throw of \(77.22 m\), a throw \(3.383 m\) longer than the average distance of her top \(10\) competitors.

Statistically speaking, whose performance are we most impressed by?

A statistician would answer this question by comparing how many standard deviations away from the mean Sha’Carri’s run time was vs how many standard deviations away from the mean Camryn’s hammer throw distance was.

Standard deviation is called the "statistician’s ruler" because it allows us to compare the spread of data relative to it’s own mean. In other words, standard deviation is a measure that allows us to compare apples to oranges!

4. What is the Normal model?

Normal/Gaussian model

The first two conditions can be assessed visually by looking at a histogram of the data. The last condition can be assessed mathematically using what is called the 68/95/99.7 Rule.

The Normal model has two model parameters: \(\mu\), the mean, and \(\sigma\), the standard deviation. These two parameters correspond directly to a measure of centrality and a measure of spread. (This isn’t the case with other probability models and their model parameters.)

Example

Let’s check to see if the data plotted in the histogram and stem plots below could plausibly com from a Normal distribution.

## 
##   The decimal point is at the |
## 
##   -8 | 0
##   -6 | 
##   -4 | 
##   -2 | 0
##   -0 | 76118876531
##    0 | 22468

The mean of this data is -0.95 and the standard deviation is 1.93.

What proportion of the data lies within one/two/three standard deviations of the mean?

5. Re-expressing Data: Standardization

Image credit: jasneko on shirt.woot.com

When comparing many variables that are all measured in different units, we might decide that we want to convert each variable into the same units so that the different data sets are easier to compare. To do this, we first must compute the mean and standard deviation of each data set, relative to its own units of measurement (e.g. average length in \(m\) or average run in \(s\)). Then, we standardize each data point relative to its mean and standard deviation by converting them into z-scores:

\[z_i = \frac{x_i - \bar{x}}{st.dev.(x_1, x_2, \dots, x_n)}.\]

Shift the data point according to the value of the mean of the data, then scale the data point according to the standard deviation of the data.

What are the units of measurement for standardized numeric data points?

If the data appears to follow a Normal/Gaussian model, then the standardized version of the data follows what is called a Standard Normal/Gaussian Model.

Does a shift up or down affect the spread of the data? Does a scale affect the spread?

Therefore the standardization process makes it so that \(\bar{z} = \frac{1}{n}(z_1+ \dots + z_n) = 0\) and \(st.dev.(z_1, z_2, \dots, z_n) = 1\).

Example

If we standardize the race time for all 2023 World Athletics Championship \(100 m\) runners, then we get a set of times that are not measured in seconds or minutes, but are unitless! Similarly, we can standardize the hammer throw distance for all competitors in this event and make the distance unitless.

The magnitude and order still are meaningful but now it is much easier to answer questions like "Whose performance was more impressive, Sha’Carri or Camryn?" because both sets of data are measured on the same (unitless) scale. [Hint: You can statistically answer this question now provided I give you the standard deviation of the \(100m\) race, \(0.14s\), and the standard deviation of the hammer throw, \(2.16m\).]