Recall

Random phenomena are often modeled as random variables. Random variables are mathematical objects (functions, actually) that have an associated probability distribution function.

A probability distribution must always specify:

1. Expected value

When referring to measures of centrality for a theoretical random variable, we typically are interested in the mean. Since we are dealing with the mathematical object, we use a new terminology to indicate that this measure of centrality is a theoretical value. Instead of calling this the mean/average of the random variable, we call this value the expected value of the random variable.

The theoretical mean of a random variable \(X\), AKA the expected value of \(X\) is:

the sum of all possibilities times their corresponding probabilities.

\[E(X) = \sum_{x\in S} \left[ x \times Pr(x) \right]\]

2. Variance and standard deviation

There is no new terminology when referring to measures of spread for a theoretical random variable. We still use variance and standard deviation to refer to these measures.

Similar to calculating the variance of a sample of data, the variance of a random variable is:

the expected value of the (squared) distance from each possibility to the mean possibility (the expected value) times the corresponding probabilities.

\[Var(X) = \sum_{x \in S} \left[(x-E(X))^2\times Pr(x)\right]\]

The standard deviation of a random variable \(X\) is still just the square root of the variance: \(st.dev(X) = \sqrt{Var(X)}\).

3. Correlation and Covariance

Correlation and covariance measure the same thing: the strength of a linear trend between two random variables. The only difference between these two measurements is that correlation is easier to interpret because it is a standardized version of covariance, i.e. correlation is always between \(-1\) and \(1\). Covariance, on the other hand, could be any number from \(-\infty\) to \(+\infty\). Because of this, although we have practiced calculating the correlation for a sample of data with two numeric variables, we have not calculated the sample covariance, even though we technically could.

Conversely, when discussing theoretical random variables, we tend to describe their covariance rather than their correlation.

\[Cov(X,Y) = E\left[(X-E(X))\cdot(Y-E(Y)) \right] = \sum_{(x,y) \in S} \left[(x-E(X)(y-E(Y))\times Pr(x \text{ and }y)) \right]\]

Example: Modeling ariline tickets

Q) How could we interpret the covariance between the price of a ticket from Philadelphia to Toronto and the price of a ticket from Philadelphia to Chengdu?

Note, the covariance between any random variable and itself is actually just another way of expressing the variance: \(Cov(X,X) = Var(X)\).

The relationship between correlation and covariance is this:

\[Cor(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}.\]

Note: In every formula given above, the terms involved can each be described as some functions of the possibilities and their corresponding probabilities. This is why we say that a probability distribution must specify only these two things: the possible values of the random variable and the probabilities associated with these possible values.


There is a lot of new notation and vocabulary to become familiar with as we begin to understand how probability and random variables operate. I’ve created a random variables and probability reference sheet to help summaries some of these new concepts; however, I recommend that you create a reference sheet of your own to help you study the material for Quiz 2.