class: center, middle, inverse, title-slide # Statistical Inference for the Difference in Population Proportions ### STAT 021 with Prof Suzy ### Swarthmore College --- <style type="text/css"> pre { background: #FFBB33; max-width: 100%; overflow-x: scroll; } .scroll-output { height: 70%; overflow-y: scroll; } .scroll-small { height: 30%; overflow-y: scroll; } .red{color: #ce151e;} .green{color: #26b421;} .blue{color: #426EF0;} </style> ## Difference Between Two Proportions ### Problem Set-up Collect responses to a yes/no survey question over two distinct groups of people. The question of interest is, "Is there a statistically significant difference in the probability of a 'yes' between the two surveys?" **Population:** We now have two populations from which we must draw a SRS. We must make sure these two populations are independent of one another! We treat the yes/no survey questions as *two, independent* Bernoulli RVs, `\(X_1\)` and `\(X_2\)`. The *population parameter* we are interested in is `\(p_1-p_2\)`: the difference between the probabilities that any randomly chosen individual from either population answers "yes" to the survey question, `$$p_1 = Pr(X_1= yes) \quad \text{and} \quad p_2 = Pr(X_2 = yes).$$` --- ## Difference Between Two Proportions ### Problem Set-up **Sample:** We have two independent random samples, one from each of the populations .pull-left[ **Sample size for sample 1:** `\(n_1=\)` the number of observations in the first sample. `\begin{align} x_{1,obs} &=& \{x_{1,1}, x_{1,2}, x_{1,3}, \dots, x_{1,n_1}\} \\ &=& \{no, yes, no, \dots, no\} \\ &=& \{0, 1, 0, \dots, 0\} \end{align}` ] .push-right[ **Sample size for sample 2:** `\(n_2=\)` the number of observations in the second sample. `\begin{align} x_{2,obs} &=& \{x_{2,1}, x_{2,2}, x_{2,3}, \dots, x_{2,n_2}\} \\ &=& \{yes, yes, no, \dots, no\} \\ &=& \{1, 1, 0, \dots, 0\} \end{align}` ] --- ## Difference Between Two Proportions ### Example One month before the election, a poll of `\(630\)` randomly selected voters showed `\(54\%\)` planning to vote for a certain candidate. A week later it became known that this candidate was involved in illicit sexual activity and a new poll showed that only `\(51\%\)` of `\(1010\)` voters still supported this candidate. **Q:** Do these results indicate a "significant" decrease in voter support? -- We will answer this question in two ways: with a CI and with a hypothesis test. --- ## Difference Between Two Proportions ### Problem Set-up Continued With our two samples, we now have an **estimate** for the parameter `\(p_1-p_2\)`: `$$\hat{p}_1 - \hat{p}_2 = \frac{1}{n_1}\sum_{i=1}^{n_1}x_{1,i} - \frac{1}{n_2}\sum_{j=1}^{n_2}x_{2,j}.$$` Before we can answer this question, we must make sure the following **assumptions** are reasonable for *both* samples: 1. The sample size is large enough for the CLT to work. (Roughly speaking we'll need at least 10 failures and 10 successes in *each* sample; in other words `\((n_1\hat{p}_1 \geq 10) \text{ and }(n_1(1-\hat{p}_1) \geq 10)\)` and `\((n_2\hat{p}_2 \geq 10) \text{ and }(n_2(1-\hat{p}_2) \geq 10)\)`. 2. The data are simple random samples (SRS) from both populations of interest. (Are there any dependencies among the data? Do the samples comprise less than `\(10\%\)` of the populations? Were these samples collected without bias?) 3. The two populations from which we obtained our two samples are independent of one another. There is no known relationship between the responses of one population and the responses of the other population. --- ## Confidence Intervals A `\((1-\alpha)\times100 \%\)` confidence interval for the true (unknown) difference `\(p_1-p_2\)` is `$$(\hat{p}_{1} - \hat{p}_{2}) \pm \left(z^*_{\alpha/2} \times \sqrt{\frac{\hat{p}_{1}(1-\hat{p}_{1})}{n_1} + \frac{\hat{p}_{2}(1-\hat{p}_{2})}{n_2}}\right),$$` where `\(z^*_{\alpha/2}\)` is the **critical value** which (as before) is the upper `\((1-\alpha)\%\)` **quantile** of a Standard Normal RV. **Q1:** What's the margin of error? -- The CLT (under the appropriate conditions) tells us that the sampling distribution of `\(\hat{p}_1 - \hat{p}_2\)` is Normally distributed. **Q2:** How can you prove this statement? --- ## Confidence Intervals ### In R ```r # Define the observed number of sucesses as a vector of two numbers, one for each sample. Do the same for the sample size. # It doesn't matter which group you put first, just be consistent! obs_successes <- c(630*0.54, 1010*0.51) sample_sizes <- c(630, 1010) # Use the same prop.test function as before prop.test(obs_successes, n = sample_sizes, conf.level = 0.95) ``` ``` ## ## 2-sample test for equality of proportions with continuity correction ## ## data: obs_successes out of sample_sizes ## X-squared = 1.2817, df = 1, p-value = 0.2576 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## -0.02093856 0.08093856 ## sample estimates: ## prop 1 prop 2 ## 0.54 0.51 ``` ```r # To get R to only show the CI we can extract it with the $conf.int appended at the end prop.test(obs_successes, n=sample_sizes, conf.level=0.95)$conf.int ``` ``` ## [1] -0.02093856 0.08093856 ## attr(,"conf.level") ## [1] 0.95 ``` --- ## Confidence Intervals ### Interpretation We are `\(95\%\)` confident that the true difference between voter support before vs after the scandal is between -0.02 and 0.08. Since this interval contains the zero, this data does **not** provide evidence that there is a *statistically* significant decrease in voter support for this candidate, despite the scandal. Now, let's compare this answer to that of a hypothesis test for this example. --- ## Hypothesis Tests First determine what significance level you want to use! To be consistent with the confidence interval we just calculated, we're going to set `\(\alpha=0.05\)`. Typically, when comparing two proportions, we are mostly interested in whether or not these proportions are the same or if one is bigger than the other. In other words, we want to test the null versus one of these three alternatives: `\begin{align} H_0: p_1 - p_2 = 0 \quad\text{and}\quad &H_A: p_1 - p_2 < 0 \\ \text{ or }&H_A: p_1 - p_2 > 0\\ \text{ or }&H_A: p_1 - p_2 \neq 0. \end{align}` To calculate the test statistic for any of the above hypotheses we use `$$\text{Test Statistic} = \text{Z-score} = \frac{(\hat{p}_{1} - \hat{p}_2) - 0}{\sqrt{\frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_1} + \frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_2}}},$$` where `\(\hat{p}_{pool} = \frac{\sum_{i=1}^{n_1}x_i + \sum_{j=1}^{n_2}y_j}{n_1 + n_2}\)`. -- **Q:** Why are we using `\(p_{pool}\)` now? --- ## Hypothesis Tests ### In R ```r #Define the observed successes and the sample sizes obs_successes <- c(630*0.54, 1010*0.51) sample_sizes <- c(630, 1010) #Determine the direction of the alternative hypothesis and the confidence level prop.test(obs_successes, n=sample_sizes, alternative="greater", conf.level=0.95) ``` ``` ## ## 2-sample test for equality of proportions with continuity correction ## ## data: obs_successes out of sample_sizes ## X-squared = 1.2817, df = 1, p-value = 0.1288 ## alternative hypothesis: greater ## 95 percent confidence interval: ## -0.01295617 1.00000000 ## sample estimates: ## prop 1 prop 2 ## 0.54 0.51 ``` ```r # Note that we don't have to specify p_null here since R knows we are looking at a difference between two groups, it knows that the null value is p_1 - p_2 - 0 ``` --- ## Hypothesis Tests ### Interpretation [draw picture of p-value] With a p-value of 0.13, which is larger than `\(\alpha\)`, we fail to reject the null hypothesis and conclude that this data does not provide us with any statistical evidence that the decrease in voter support is significantly different from zero.