The (probability) distribution of a random variable must supply information about all possible values the variable can take on and how frequently (or infrequently) the random variable will take on these values.
In short, a distribution specifies the possibilities and corresponding probabilities of a random variable. A density plot can reflect this information graphically. (You can think of a density plot/curve as a smoothed-out version of a histogram.)
Last class we introduced the Normal (probability) model. A special feature of the Normal model is the 68/95/99.7 rule that describes how common (or rare) certain number values around the mean are. This rule applies to the density plot of the Normal model below.
We also discussed how for a sample of numeric data, it is helpful to have a sense of the center value(s) and the spread of the values. The same is true for random variables, (RV for short) i.e. special functions that exist in the math/theory world and are used as models for observable samples of numeric data.
When talking about a distribution of a sample of data (from the real world), we can refer to the distribution of the sample as the sampling distribution. When referring to a probability distribution (that exists in the mathematical modeling world), we just call this a distribution for a random variable.
You can get as creative as you want with defining distributions, there are infinitely many and you can make up new ones too! (These distributions do have to follow some mathematical rules though as we will discuss later in Unit 2.)
However, there are about a dozen common probability distributions that a frequently used to model real world phenomena. These common distributions (like the Normal distribution) are named for convenience.
The Uniform\((a,b)\) distribution evenly divides the range of the possible values of numbers from \(a\) to \(b\) so that any number (even infinite decimals) have an equal chance of occurring. The density plot is show below.
The Chi-square distribution is a bit different in that it is a skewed left distribution that describes the probabilities associated with only positive possible numbers.
When we have two or more different samples of data (for the same variable), an exploratory analysis of this data will include describing any similarities or differences between/among the sample distributions.
A useful mathematical summary of a (sample or random variable) distribution is the five number summary which consists of the
Smallest value possible (minimum);
First quartile (where \(25\%\) of the data is smaller than this value);
Median (where \(50\%\) of the data is smaller than this value and the other \(50\%\) of the data is larger than this value);
Third quartile (where \(75\%\) of the data is smaller than this value);
Largest value possible (maximum).
For the sample of data pictured below, what can you tell from its histogram that you can’t tell from its five number summary?
set.seed(102)
my_samp <- round(rnorm(20, 8, 2), 3)
summary(my_samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.643 7.295 8.784 8.584 10.420 11.967
hist(my_samp, breaks=15)
If we have less than a handful of different samples of data (for the same variable), then we can compare their histograms side-by-side or overlapped.
But, if there are over a dozen or so samples of data, comparing histograms is pretty messy to look at so we’ll typically visualize the different sample distributions with side-by-side boxplots. Recall the following plot from last class:
Another notable descriptive feature of a distribution are its quantiles/percentiles and percentages, although this is just a somewhat fancier way of saying its possibilities (quantiles) and probabilities (percentiles)! The five number summary of a distribution is actually a list of the five quantiles that corresponding to \(0\%\), \(25\%\), \(50\%\), \(75\%\), and \(100\%\) percentiles (often called quartiles since they break the data values into fourths).
In a blank cell, type \(=\) and then the following code but replacing everything in brackets (including the brackets < >) with the appropriate values for your particular problem. Then hit Return/Enter to see the result. Note that Excel always calculates lower quantiles. (See some additional documentation for this function.)
NORM.INV(<probability>, <mean>, <sd>)
Use the following code but erase everything in brackets (including the brackets < >) and replace with the appropriate values for your particular problem. Then highlight the code and execute it to see the result. Note that R allows you to compute lower or upper quantiles.
qnorm(<probability>, mean=<?>, sd=<?>, lower.tail= <TRUE/FALSE>)
In a blank cell, type \(=\) and then the following code but replacing everything in brackets (including the brackets < >) with the appropriate values for your particular problem. Then hit Return/Enter to see the result. Note that Excel always calculates lower tailed probabilities. (See some additional documentation for this function.) Don’t worry about the last arguement, for our purposes it will always need to be set to “TRUE”.
NORM.DIST(<quantile>, <mean>, <sd>, TRUE)
Use the following code but erase everything in brackets (including the brackets < >) and replace with the appropriate values for your particular problem. Then highlight the code and execute it to see the result. Note that R allows you to compute lower or upper probabilities.
pnorm(<quantile>, mean=<?>, sd=<?>, lower.tail= <TRUE/FALSE>)
Last week, we discussed a common way to re-express numerical data to get rid of any units of measurement. The transformation of a data point into its z-score is called the standardization transformation and it involves both shifting the data values (to the left or right according to their mean) and scaling the data values according to their standard deviation (scaling up if the standard deviation is \(>1\) or scaling down if the standard deviation is \(<1\)).
We could shift or scale data by any amount we’d like and it would be helpful to know if and how these transformations affect the sample distribution, including descriptors of the distribution such as the five number summary or quantiles and percentiles.
mean(my_samp); sd(my_samp)
## [1] 8.58445
## [1] 2.310281
mean(my_samp+30); sd(my_samp + 30)
## [1] 38.58445
## [1] 2.310281
mean(15*my_samp); sd(15*my_samp)
## [1] 128.7668
## [1] 34.65421
The original data (in gray) comes from a \(N(\mu=-75,sd=3.3)\) distribution.
The data scaled by a factor of 2 is in green,
The data shifted by 3 units is in orange,
The data scaled and shifted is shown in blue.
We can often assess whether or not a sample of data seems to follow a Normal distribution by investigating the quantiles of the sample distribution. In a Normal probability plot each sample quantile (vertical axis below) is plotted against its corresponding standarized value (its z-score - horizontal axis below). A straight line relating these values indicates a strong match with the Normal model.
qqnorm(my_samp)