Week 2: Inference for a proportion

class: center, middle, inverse, title-slide

.title[
# Week 2: Inference for a proportion
]
.author[
### STAT 021 with Suzanne Thornton
]
.institute[
### Swarthmore College
]

---

.scroll-output {
  height: 70%;
  overflow-y: scroll;
}

.scroll-small {
  height: 30%;
  overflow-y: scroll;
}
   
.red{color: #ce151e;}
.green{color: #26b421;}
.blue{color: #426EF0;}
</style>

## Population Proportion 
### Problem Set-up

Collect a survey with a yes/no question.

**Population:** Adults over the age of 18 who live in NYC

If we treat the yes/no survey question as a Bernoulli RV, `$X$`, then the *population parameter* we are interested in is `$p$`: the probability that any randomly chosen individual from the population answers "yes" to the question.

`$$p = Pr(X= yes)$$`

**Sample:** A random subset of the population

`\begin{align}
x_{obs} &=& \{x_1, x_2, x_3, \dots, x_n\} \\
 &=& \{yes, yes, no, \dots, yes\} \\
 &=& \{1, 1, 0, \dots, 1\}
\end{align}`

**Sample size:** `$n=$` the number of people who responded to the survey.

---
## Population Proportion 
### Problem Set-up

After we have surveyed our entire sample of `$n$` individuals, we now have an **estimate** for the population parameter: 
`$$\hat{p} = \frac{1}{n}\sum_{i=1}^{n}x_i = \frac{\text{the total number of yes's}}{n}.$$`

Although it's easy enough for us to compute `$\hat{p}$`, the question remains:

.blue[How close is this sample estimate,] `$\hat{p}$`.blue[, to the (unknown) population parameter value,] `$p$`.blue[?]

--
To answer this question we can use CIs or a hypothesis test. Before doing so, we must make sure the following **assumptions** are reasonable!

1. The sample size is large enough for the CLT to work. (Roughly speaking we'll need at least 10 failures and 10 successes in our observations, in other words `$(n\hat{p} \geq 10) \text{ and }(n(1-\hat{p}) \geq 10)$`

2. The data is a simple random sample (SRS) from the population of interest. (Are there any dependencies among the data? Does the sample comprise less than `$10\%$` of the population? Was the sample collected without bias?)

---
## Confidence Intervals

A `$(1-\alpha)\times100 \%$` confidence interval for parameter `$p$` can is: 
`$$\hat{p}_{obs} \pm \left(z^*_{\alpha/2} \times \sqrt{\frac{\hat{p}_{obs}(1-\hat{p}_{obs})}{n}}\right).$$`

In the above, `$z^*_{\alpha/2}$` is the **critical value**. Since the CLT (under the appropriate conditions) tells us that the sampling distribution of `$\hat{p}$` is Normal, I'm using `$z$` here to clarify that this critical value is actually the upper `$(1-\frac{\alpha}{2})\%$` **quantile** of a Standard Normal RV.

The significance level is `$\alpha = 1 - \text{confidence level},$` where, typically, the confidence level is set to `$90\%$`, `$95\%$`, or `$99\%$`.

--
**Q1:** What's the margin of error?

---
## Confidence Intervals

A `$(1-\alpha)\times100 \%$` confidence interval for parameter `$p$` can is: 
`$$\hat{p}_{obs} \pm \left(z^*_{\alpha/2} \times \sqrt{\frac{\hat{p}_{obs}(1-\hat{p}_{obs})}{n}}\right).$$`

The significance level is `$\alpha = 1 - \text{confidence level}$` where, typically, the confidence level is set to `$90\%$`, `$95\%$`, or `$99\%$`.

**Q2:** What's stopping us from getting a `$100\%$` confidence interval?

---
## Confidence Intervals

### Interpretation

If we were to collect more size `$n$` simple random samples (SRS) from the same population, we would get slightly different confidence intervals. But, in the long run, `$(1-\alpha)\times 100\%$` of those confidence intervals will contain the true population proportion, `$p$`.

### In R
.scroll-small[

```r
#Define the observed successes 
obs_success <- 7
#Then use the prop.test() function 
prop.test(obs_success, n=10, conf.level=0.90)
```

```
## 
## 	1-sample proportions test with continuity correction
## 
## data:  obs_success out of 10, null probability 0.5
## X-squared = 0.9, df = 1, p-value = 0.3428
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
##  0.3956524 0.9035510
## sample estimates:
##   p 
## 0.7
```

```r
#To get R to only show the CI we can extract it with the $conf.int appended at the end
prop.test(obs_success, n=10, conf.level=0.90)$conf.int
```

```
## [1] 0.3956524 0.9035510
## attr(,"conf.level")
## [1] 0.9
```
]

---
## Hypothesis Tests

We first must identify the null and alternative hypotheses:
`\begin{align}
H_0: p = p_0 \quad\text{and}\quad &H_A: p > p_0 \\
      \text{ or }&H_A: p < p_0\\
      \text{ or }&H_A: p \neq p_0.
\end{align}`

Then we calculate the test statistic
`$$\text{Test Statistic} = \text{Z-score} = \frac{\hat{p} - p_{0}}{\sqrt{\frac{p_0(1-p_0)}{n}}}.$$`

Compare the equation above to the equation for a confidence interval.

--
**Q:** Why are we using `$p_0$` in the denominator?

---
## Hypothesis Tests 
### Interpretation

.red[It is really helpful to draw picture!]

Once you have the picture, drawing your conclusion is straightforward.

--
Compute a p-value for the observed test statistic under the null hypothesis and compare it to the significance level `$\alpha$`.

- If the data supports/is consistent with the null hypothesis, we *fail to reject the null* and conclude that the null hypothesis seems reasonable according to the data.

- If it does not support the null hypothesis, we say that we *reject the null* hypothesis because the data provides evidence in favor of the alternative.

For the conclusion of a hypothesis test to be reliable, one must specify `$\alpha$` before collecting the data (and must not change it after seeing the data calculations)!

---
## Hypothesis Tests

### In R

.scroll-output[

```r
#Define the observed successes and the value of p in the null hypothesis
obs_success <- 7
p0 <- 0.5 
#Determine the direction of the alternative hypothesis, the confidence level, and the sample size n
prop.test(obs_success, n=10, p=p0, alternative="greater", conf.level=0.90)
```

```
## 
## 	1-sample proportions test with continuity correction
## 
## data:  obs_success out of 10, null probability p0
## X-squared = 0.9, df = 1, p-value = 0.1714
## alternative hypothesis: true p is greater than 0.5
## 90 percent confidence interval:
##  0.4484488 1.0000000
## sample estimates:
##   p 
## 0.7
```

```r
#For more information on the function (like the possible directions for the alternative hypothesis), run the following line of code:
help(prop.test)
```
]