Week 3: Inference for the difference between two population means

class: center, middle, inverse, title-slide

# Week 3: Inference for the difference between two population means
### STAT 021 with Suzanne Thornton
### Swarthmore College

---

.scroll-output {
  height: 75%;
  overflow-y: scroll;
}

.scroll-small {
  height: 50%;
  overflow-y: scroll;
}
   
.red{color: #ce151e;}
.green{color: #26b421;}
.blue{color: #426EF0;}
</style>

## Difference in means  
### Two problem settings

.pull-left[**Case 1:** Un-paired (independent) samples

In this setting, we have two .blue[independent] samples from two populations (e.g. comparing the SAT scores of public versus private high schools). The two samples may or may not have the same sample size and/or may or may not have similar sample variances.]

.push-right[**Case 2:** Paired samples

In this setting, we have two samples but they are from the same or two .blue[highly dependent population(s)] (e.g. we may be studying genetic traits across twins or test scores of the same people before and after attending a workshop on test-taking strategies). The samples will always be the same size in this setting and the sample variances should be similar.]

---
## Case 1 - Un-paired (independent) samples 
### Problem setting

**Population:**

If we have two samples from different populations, we can consider the quantitative RVs as *two, independent* continuous RVs, `$X_1$` and `$X_2$`. The *population parameter* we are interested in is `$\mu_1-\mu_2$`: the difference between the two population means,

`$$\mu_1 = E[X_1] = \sum_{x \in S_1} xPr(X_1=x), \text{ and } \mu_2 = E[X_2] = \sum_{x \in S_2} xPr(X_2=x),$$`
where `$S_1$` and `$S_2$` are the sample spacers of all possible values `$X_1$` and `$X_2$` might take, respectively.

**Sample:** Two independent random samples from each population

`\begin{align}
x_{1, obs} &=& \{x_{1,1}, x_{1,2}, x_{1,3}, \dots, x_{1,n_1}\} \text{ and}\\
x_{2, obs} &=& \{x_{2,1}, x_{2,2}, \dots, x_{2,n_2}\} 
\end{align}`

**Sample size:** `$n_1=$` the number of observational units in the first sample; `$n_2=$` the number of observational units in the second sample

---
## Case 1 - Un-paired (independent) samples 
### Problem setting

With our two samples, we now have an **estimate** for the parameter of interest `$\mu_1-\mu_2$`: 
`$$\bar{X}_1 - \bar{X}_2 = \frac{1}{n_1}\sum_{i=1}^{n_1}x_{1,i} - \frac{1}{n_2}\sum_{j=1}^{n_2}x_{2,j}.$$`

We also know that for RVs, `$X$` and `$Y$`, `$Var(X-Y) = Var(X) + Var(Y) - 2Cov(X,Y)$` and that if `$X$` and `$Y$` are independent, then `$Cov(X,Y)=0$`. This means that if we consider `$\bar{X}_1$` and `$\bar{X}_2$` as independent RVs with their own sampling distributions, then 
`$$Var(\bar{X}_1 - \bar{X}_2) = Var(\bar{X}_1) + Var(\bar{X}_2) + 0 = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2},$$`
where `$\sigma_1^2$` is the unknown variance for `$X_1$` over the first population and `$\sigma_2^2$` is the same but for the second population.

---
## Case 1 - Un-paired (independent) samples 
### Problem setting

Naturally, we want to know .blue[how close the sample estimate is to the parameter of interest?] Before we can address this, we must make sure the following **assumptions** are reasonable for *both* samples:

--
1. The two populations must be independent of one another. There is no statistical test for this assumption. You must carefully think about any possible dependencies between the two populations before proceeding.

2. The values the quantitative variables `$X_1$` and `$X_2$` can take on, follow roughly a symmetric and uni-modal histogram. (Since we aren't able to collect data on the entire population, we typically can only check this assumption with two histograms for both sets of observed values.

3. The data are simple random samples (SRS) from both populations of interest. (Are there any dependencies among the data? Do the samples comprise less than `$10\%$` of the populations? Were these samples collected without bias?)

---
## Case 1 - Un-paired (independent) samples 
### Confidence interval

A `$(1-\alpha)\times100 \%$` confidence interval for the true (unknown) difference `$\mu_1-\mu_2$` is

`$$(\bar{X}_{1} - \bar{X}_{2}) \pm \left(t^*_{\alpha/2} \times \sqrt{\frac{s_1^2}{n_1}+ \frac{s_2^2}{n_2} }\right),$$`
where `$t^*_{\alpha/2}$` is the **critical value** which (as before) is the upper `$\left(1-\frac{\alpha}{2}\right)\%$` **quantile** of a Student's T RV with `$\nu$` degrees of freedom.

**Q:** What is `$\nu$`?

--
`$$\nu = \frac{\left(\frac{s_1^2}{n_1}+ \frac{s_2^2}{n_2} \right)^2}{\frac{1}{n_1-1}\left(\frac{s_1^2}{n_1}\right)^2 + \frac{1}{n_2-1}\left(\frac{s_2^2}{n_2}\right)^2 }$$`
But don't worry, you won't need to calculate this because R does that for you behind the scenes!

---
## Case 1 - Un-paired (independent) samples 
### Confidence interval in R

.scroll-output[
Example: Suppose we are comparing the battery life (in minutes) of two different brands of AA batteries.

## In R

```r
#Define the observed data 
x1_obs <- c(160, 200, 205, 181, 190, 187, 172, 193, 182, 192)
x2_obs <- c(212, 207, 198, 199, 208, 209, 192, 207)
#Then use the t.test() function 
t.test(x1_obs, x2_obs, paired=FALSE, conf.level=0.95)$conf.int
```

```
## [1] -28.155691  -7.444309
## attr(,"conf.level")
## [1] 0.95
```
]

---
## Case 1 - Un-paired (independent) samples 
### Confidence interval interpretation

Based on the output on the previous slide, we report that we are `$95\%$` confident that the true difference between battery life between these two brands is
-28.16 and -7.44. Since this interval lies .blue[entirely below zero], this data does provides *statistical* evidence of a difference in the two brands' average battery life.

More specifically, if we were to repeat this experiment again and again, sampling `$10$` batteries from the first brand and `$8$` batteries from the second brand, then 95% of the time we will get a lower and upper bound that contains the true difference in battery life between these two brands. .blue[Based on this particular experiment, we are 95% confident that the true difference in average battery life between the two brands is between `$-28.16$` hours and `$-7.44$` hours, in favor of the second brand.]

.footnote[The part in blue is what I would expect as an answer on a homework or test question.]

---
## Case 1 - Un-paired (independent) samples 
### Hypothesis test

First determine what significance level you want to use! To be consistent with the confidence interval we just calculated, we're going to set `$\alpha=0.05$`.

As it was with testing the difference in proportions, when comparing two means we are mostly interested in whether or not these means are the same or if one is bigger than the other. In other words, we want to test the null versus one of these three alternatives:
`\begin{align}
H_0: \mu_1 - \mu_2 = 0 \quad\text{and}\quad &H_A:  \mu_1 - \mu_2 < 0 \\
      \text{ or }&H_A:  \mu_1 - \mu_2 > 0\\
      \text{ or }&H_A:  \mu_1 - \mu_2 \neq 0.
\end{align}`

To calculate the test statistic for any of the above hypotheses we use 
`$$\text{Test Statistic} = \text{T-score} = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}.$$`

---
## Case 1 - Un-paired (independent) samples 
### Hypothesis test

.scroll-output[
## In R

```r
t.test(x1_obs, x2_obs, alternative="two.sided", paired=FALSE)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  x1_obs and x2_obs
## t = -3.6861, df = 14.022, p-value = 0.002438
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -28.155691  -7.444309
## sample estimates:
## mean of x mean of y 
##     186.2     204.0
```
]

---
## Case 2 - Paired samples
### Problem setting

**Population:**

If we have two dependent samples, we can model the difference between them as a single continuous RV, `$\Delta$`. The *population parameter* we are interested in is `$\mu$`: the mean of the RV `$\Delta$`,

`$$\mu = E[\Delta] = \sum_{d \in S} dPr(\Delta=d),$$`
where `$S$` is the sample spacers of all possible values the difference, `$D$`, might take.

**Sample:** We have one random subset of observations from the population, but we have two different instances of observations:

`$$x_{obs_1} = \{x_{1,1}, x_{1,2}, x_{1,3}, \dots, x_{1,n}\} \quad \text{ and }\quad x_{obs_2} = \{x_{2,1}, x_{2,2}, \dots, x_{2,n}\}.$$`

The data we are really interested in is the difference between these observations:

`$$d_{obs} = \{d_1, d_2, d_3, \dots, d_n \} = \{(x_{1,1} - x_{2,1}), (x_{1,2}-x_{2,2}), (x_{1,3}-x_{2,3}), \dots, (x_{1,n}-x_{2,n})\}.$$`

The sample size is therefore just the number of **dependent** pairs, `$n$`.

---
## Case 2 - Paired samples
### Problem setting

Now we are in the exact same setting we've handled before where all we need to do is a one-sample t-test or confidence interval!

### Assumptions 
The assumptions are the same as they were for one-sample inference for a population mean, however, we now need to make sure the assumptions hold for the *differences* of the paired observations, rather than for the observations themselves.

--
1. The values the quantitative variable `$\Delta$` can take on, follow roughly a symmetric and uni-modal distribution. For larger sample sizes (i.e. more matched pairs), we can relax this assumption.

2. The differenced data, `$d_{obs}$`, are independent of one another, or in other words, the differenced data represent a simple random sample (SRS) from a larger population of differenced RVs. The only dependency in the data should be between the pairs.

---
## Case 2 - Paired samples
### Problem setting

**In observational studies:** think about the larger population of the RV `$\Delta$` and verify that `$n$` is less than `$10\%$` the size of the entire population of differences. In this setting, we refer to paired data as *matched* data.

**In experiments:** think about possible sources of bias in assignment of treatments. Randomization, or, the random assignment of individuals to a treatment or a control group is the best way to avoid bias. In this setting, we say the data is *blocked* according to whatever defines one pair from the next.

--
- Paired data will always have the sample number of observations in each group. However, just because two groups have the same number of observations .red[does not] mean that the data is paired.

- Paired data is inherently *dependent*, each set of observations represents different values of the same observational unit (e.g. before/after, twins). Un-paired data must be from groups that are *independent* of one another.

---
## Case 2 - Paired samples
### In R

Since everything about paired, two-sample tests and CIs is exactly the same as it was for a single sample test and CI, we will instead go over the data processing steps here.

The only processing step needed here is to compute the difference between the two dependent observations.

.scroll-small[

```r
#Define the observed data and then calculate the difference between the pairs 
x1_obs <- c(160, 200, 205, 181, 190, 187, 172, 193, 182, 192)
x2_obs <- c(155, 195, 200, 173, 192, 184, 171, 196, 179, 188)
diff_obs <- x1_obs - x2_obs

#Then use the t.test() function on the differenced data
t.test(diff_obs, conf.level=0.95)
```

```
## 
## 	One Sample t-test
## 
## data:  diff_obs
## t = 2.7121, df = 9, p-value = 0.02391
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4811485 5.3188515
## sample estimates:
## mean of x 
##       2.9
```
]