1. What is a (probability) distribution?

The (probability) distribution of a random variable must supply information about all possible values the variable can take on and how frequently (or infrequently) the random variable will take on these values.

In short, a distribution specifies the possibilities and corresponding probabilities of a random variable. A density plot can reflect this information graphically. (You can think of a density plot/curve as a smoothed-out version of a histogram.)

Recall: Normal/Gaussian Distribution

Last class we introduced the Normal (probability) model. A special feature of the Normal model is the 68/95/99.7 rule that describes how common (or rare) certain number values around the mean are. This rule applies to the density plot of the Normal model below.

We also discussed how for a sample of numeric data, it is helpful to have a sense of the center value(s) and the spread of the values. The same is true for random variables, (RV for short) i.e. special functions that exist in the math/theory world and are used as models for observable samples of numeric data.

When talking about a distribution of a sample of data (from the real world), we can refer to the distribution of the sample as the sampling distribution. When referring to a probability distribution (that exists in the mathematical modeling world), we just call this a distribution for a random variable.

Examples of more distributions

You can get as creative as you want with defining distributions, there are infinitely many and you can make up new ones too! (These distributions do have to follow some mathematical rules though as we will discuss later in Unit 2.)

However, there are about a dozen common probability distributions that a frequently used to model real world phenomena. These common distributions (like the Normal distribution) are named for convenience.

Uniform random variables

The Uniform\((a,b)\) distribution evenly divides the range of the possible values of numbers from \(a\) to \(b\) so that any number (even infinite decimals) have an equal chance of occurring. The density plot is show below.

Chi-square random variables

The Chi-square distribution is a bit different in that it is a skewed left distribution that describes the probabilities associated with only positive possible numbers.

2. Comparing distributions of numeric variables

When we have two or more different samples of data (for the same variable), an exploratory analysis of this data will include describing any similarities or differences between/among the sample distributions.

A useful mathematical summary of a (sample or random variable) distribution is the five number summary which consists of the

Smallest value possible (minimum);
First quartile (where \(25\%\) of the data is smaller than this value);
Median (where \(50\%\) of the data is smaller than this value and the other \(50\%\) of the data is larger than this value);
Third quartile (where \(75\%\) of the data is smaller than this value);
Largest value possible (maximum).

For the sample of data pictured below, what can you tell from its histogram that you can’t tell from its five number summary?

set.seed(102)
my_samp <- round(rnorm(20, 8, 2), 3)
summary(my_samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.643   7.295   8.784   8.584  10.420  11.967

hist(my_samp, breaks=15)

If we have less than a handful of different samples of data (for the same variable), then we can compare their histograms side-by-side or overlapped.

But, if there are over a dozen or so samples of data, comparing histograms is pretty messy to look at so we’ll typically visualize the different sample distributions with side-by-side boxplots. Recall the following plot from last class:

3. Quantiles and probabilities/percentages

Another notable descriptive feature of a distribution are its quantiles/percentiles and percentages, although this is just a somewhat fancier way of saying its possibilities (quantiles) and probabilities (percentiles)! The five number summary of a distribution is actually a list of the five quantiles that corresponding to \(0\%\), \(25\%\), \(50\%\), \(75\%\), and \(100\%\) percentiles (often called quartiles since they break the data values into fourths).

General Normal distribution

Finding Normal quantiles in Excel

In a blank cell, type \(=\) and then the following code but replacing everything in brackets (including the brackets < >) with the appropriate values for your particular problem. Then hit Return/Enter to see the result. Note that Excel always calculates lower quantiles. (See some additional documentation for this function.)

NORM.INV(<probability>, <mean>, <sd>)

Finding Normal quantiles in R

Use the following code but erase everything in brackets (including the brackets < >) and replace with the appropriate values for your particular problem. Then highlight the code and execute it to see the result. Note that R allows you to compute lower or upper quantiles.

qnorm(<probability>, mean=<?>, sd=<?>, lower.tail= <TRUE/FALSE>)

Finding Normal probabilities in Excel

In a blank cell, type \(=\) and then the following code but replacing everything in brackets (including the brackets < >) with the appropriate values for your particular problem. Then hit Return/Enter to see the result. Note that Excel always calculates lower tailed probabilities. (See some additional documentation for this function.) Don’t worry about the last arguement, for our purposes it will always need to be set to “TRUE”.

NORM.DIST(<quantile>, <mean>, <sd>, TRUE)

Finding Normal probabilities in R

pnorm(<quantile>, mean=<?>, sd=<?>, lower.tail= <TRUE/FALSE>)

4. Re-expressing (transforming) data

Last week, we discussed a common way to re-express numerical data to get rid of any units of measurement. The transformation of a data point into its z-score is called the standardization transformation and it involves both shifting the data values (to the left or right according to their mean) and scaling the data values according to their standard deviation (scaling up if the standard deviation is \(>1\) or scaling down if the standard deviation is \(<1\)).

We could shift or scale data by any amount we’d like and it would be helpful to know if and how these transformations affect the sample distribution, including descriptors of the distribution such as the five number summary or quantiles and percentiles.

Shifted and/or scaled mean and standard deviation

mean(my_samp); sd(my_samp)

## [1] 8.58445

## [1] 2.310281

mean(my_samp+30); sd(my_samp + 30)

## [1] 38.58445

## [1] 2.310281

mean(15*my_samp); sd(15*my_samp)

## [1] 128.7668

## [1] 34.65421

Shfited and/or scaled Normal distributions

The original data (in gray) comes from a \(N(\mu=-75,sd=3.3)\) distribution.

The data scaled by a factor of 2 is in green,
The data shifted by 3 units is in orange,
The data scaled and shifted is shown in blue.

Assessing Normality

We can often assess whether or not a sample of data seems to follow a Normal distribution by investigating the quantiles of the sample distribution. In a Normal probability plot each sample quantile (vertical axis below) is plotted against its corresponding standarized value (its z-score - horizontal axis below). A straight line relating these values indicates a strong match with the Normal model.

qqnorm(my_samp)

Source: https://blogs.sas.com/content/graphicallyspeaking/2012/02/06/comparative-densities/↩︎

Stat 11 Week 3

Comparing Entire Distributions

Prof. Suzy Thornton

Spring 2023