1. Comparing Groups: Difference between proportions

Under rather general conditions, the sampling distribution of the difference between two independent proportions is a Normal distribution with expectation, \[E(\hat{p}_1 - \hat{p}_2) = p_1 - p_2\] and variance \[Var(\hat{p}_1 - \hat{p}_2) = \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}.\] The estimate for the sampling standard deviation of \(\hat{p}_1 - \hat{p}_2\) is the standard error: \[SE(\hat{p}_1 - \hat{p}_2) = \sqrt{ \frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}.\]

Assumptions and conditions

  • Independence assumption

    • Each sample consists of independent observations

    • The two samples are independent from one another (unpaired)

  • Randomization condition

  • Success/failure condition - both samples must contain enough failures and successes

A two-proportion z-interval

\[(\hat{p}_1 - \hat{p}_2) \pm \left[z^*_{a} \times SE(\hat{p}_1 - \hat{p}_2)\right],\] where, as before, \(z_a^*\) is the lower (or upper) \(\left(\frac{1-a}{2}\right)^{th}\) quantile of a \(N(0,1)\) distribution.

A two-sample z-test

\[H_0: p_1 - p_2 = 0\]

\[T.S. = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE\left(\hat{p}_1 - \hat{p}_2 \right)} \stackrel{H_0}{\sim} N(0,1)\]

Example: Hypothesis test for a difference in proportions

Are people who work for non-profits generally more satisfied with their jobs compared to those who work at for-profit companies?

Separate random samples were collected by a polling agency to investigate the difference. Data collected from \(422\) employees at non-profit organizations revealed. that \(377\) of them were “highly satisfied”. From the for-profit companies, \(431\) out of \(518\) employees reported the same level of satisfaction.

Step 1) Identify and define the population parameter and choose your significance level.

Step 2) State the null and alternative hypotheses.

Step 3) Assess the required assumptions and conditions.

Step 4) Calculate the test statistic and plot it.

Step 5) Shade the area under the curve that corresponds to your p-value and then calculate it.


2. Comparing Groups: Difference between two means

Under rather general conditions, the sampling distribution of the difference between two independent means is a Normal distribution with expectation, \[E(\bar{x}_1 - \bar{x}_2) = \mu_1 - \mu_2\] and variance \[Var(\bar{x}_1 - \bar{x}_2) = \frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}.\] The estimate for the sampling standard deviation of \(\bar{x}_1 - \bar{x}_2\) is the standard error: \[SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}},\] where \(s_1^2 = \frac{1}{n_1-1}\sum_{i=1}^{n_1}(x_{1,i} - \bar{x}_1)^2\) and \(s_2^2 = \frac{1}{n_2-1}\sum_{i=1}^{n_2}(x_{2,i} - \bar{x}_2)^2\).

Assumptions and conditions

  • Independence assumption

    • Each sample consists of independent observations

    • The two samples are independent from one another (unpaired)

  • Randomization condition

  • Nearly Normal Condition or both sample sizes are large enough for the CLT to approximate the sampling distribution of \(\bar{x}_1\) and \(\bar{x}_2\).

Two-sample independent t-interval

\[(\bar{x}_1 - \bar{x}_2) \pm \left[t^*_a \times SE(\bar{x}_1 - \bar{x}_2) \right],\] where \(t^*_a\) is again, the \(\left(\frac{1-a}{2}\right)^{th}\) lower quantile of a Student’s t-distribution, but the degrees of freedom are difficult to calculate, so we typically let computers do this part.

Two-sample independent t-test

\[H_0: \mu_1 - \mu_2 = \Delta_0\]

\[T.S. = \frac{(\bar{x}_1 - \bar{x}_2) - \Delta_0}{SE(\bar{x}_1 - \bar{x}_2)} \stackrel{H_0}{\sim} \text{Student's t dist with some (complicated) degrees of freedom}\]

Pooling

In the special case where we have data from a randomized experiment, we generally make the assumption that each treatment group has the same population variance. (Random assignment is what makes this assumption justifiable.)

In this special case, we use a pooled standard error for \((\bar{x}_1 - \bar{x}_2)\): \[SE_{pool}(\bar{x}_1 - \bar{x}_2) = s_{pool}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}},\] where \(s_{pool} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{(n_1-1) + (n_2-2)}}.\)

Example: Confidence interval for a difference in means

On average, how much more money do consumers spend at Target compared to Walmart?

Suppose researchers collected a systematic sample from \(85\) Walmart customers and \(80\) Target customers by asking them for their purchase amount as they left the stores. The data they collected is summarized in the table below. Suppose a computer already calculated the degrees of freedom to be \(162.75\).

Walmart Target
\(\bar{x}\) \(\$45\) \(\$53\)
s \(\$21\) \(\$19\)

Step 1) Identify and define the population parameter and choose your confidence level.

Step 2) Calculate the sample estimate for the population parameter.

Step 3) Assess the required assumptions and conditions.

Step 4) Find the critical value corresponding to your confidence level.

Step 5) Calculate the standard error of your sample estimate.

Step 6) Calculate the lower and upper bounds of your confidence interval.


3. Comparing groups: Difference between two paired means

Now we will discuss how to conduct inference for the difference of paired sample means. That is, we will consider confidence intervals and hypothesis tests for \((\mu_1 - \mu_2)\) when there is a strong relationship between the two populations sampled.

Spoiler alert: This special case actually just simplifies to a one-sample t-test for a mean!

For data with paired numeric variables, we will create a new variable representing their difference and then conduct inference on the single variable that represents this differenced data. The sample size in these problems is actually the number of pairs or differences.

\[SE(\bar{d}) = \sqrt{\frac{s_d^2}{n}}\]

Assumptions and conditions

  • Paired assumption

    • The two samples are strongly dependent in a paired manner. Specifically, each data point from one sample has a corresponding paired data point in the other sample. This typically is the case when our data represent before and after measurements on the same observational unit. Other examples include:

    • the IQ scores of twins,

    • reaction times with dominant and non-dominant hands,

    • happiness levels of married couples.

  • Independence assumption

    • Each observed difference (between a pair of data points) is independent of the other differences
  • Randomization condition

  • Nearly Normal Condition or the number of pairs is large enough for the CLT to approximate the sampling distribution of \(\bar{x}_1\) and \(\bar{x}_2\).

Paired t-interval

\[\bar{d} \pm \left[t^*_{a, n-1} \times SE(\bar{d}) \right],\] where \(t^*_a\) is the \(\left(\frac{1-a}{2}\right)^{th}\) lower quantile of a Student’s t-distribution with \((n-1)\) degrees of freedom.

Paired t-test

Let \(\mu_d = \mu_1 - \mu_2\) be the unknown true difference in the means of two dependent populations.

\[H_0: \mu_d = \Delta_0\]

\[T.S. = \frac{\bar{d} - \Delta_0}{SE(\bar{d})} \stackrel{H_0}{\sim} \text{Student's t dist with $(n-1)$ degrees of freedom}\]

4. In summary

Data: \(\{x_1, x_2, \dots, x_{n_1}\}\) and \(\{y_1, y_2 \dots, y_{n_2}\}\)

Random variables: \(X\) and \(Y\)

Where to start?

If you feel like you’re starting to get lost with all the different statistical methods we’ve explored in this unit, don’t fret! There are plenty of flow-chart-like guides to using different tests. For example, here is a flow-chart created by a stats grad student. You could also make your own! The starting point for using any of these kinds of guides is to first assess the data that you have (or want to have). This means you need to be able to answer questions like:

  • What constitutes an observational unit?

  • What are the variables and what are their types?

  • Do you expect there to be any relationships among the variables given your knowledge of the subject?

  • Is your sample random? If not, is your sample conceivably representative of any larger population?