Data: \(x_1, x_2, \dots, x_n\) corresponds to a column of our data set
Random variable: \(X\) corresponds to the probability model for the variable we observe in our data set
Population mean, \(\mu\)
In this case \(X\) can follow just about any distribution (skewed, bimodal, discrete, etc). We only care that \(E(X) = \mu\) and \(Var(X) = \sigma^2\).
As long as the CLT applies, then the sampling distribution for the sample mean is: \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\).
Population proportion, \(p\)
In this case \(X \sim Bernoulli(p)\) so \(E(X) = p\) and \(Var(X) = p(1-p)\).
As long as we have enough failures and successes to use the Normal approximation to the Binomial distribution, then the sampling distribution for the sample proportion is: \(\hat{p} \sim N\left(p, \frac{p(1-p)}{n} \right)\).
The National Perinatal Statistics Unit of the Sydney Children’s Hospital reports that the mean birth weight of all babies born in birth centers in Australia in a recent year was \(3564\) grams—about \(7.86\) pounds. A Missouri hospital reports that the average weight of \(112\) babies born there last year was \(7.68\) pounds, with a standard deviation of \(1.31\) pounds. We want to see if U.S. babies weigh the same, on average, as Australian babies.
Q) Why do we bother with hypothesis tests?
Suppose we want to know if the probability of success is really \(0.5\), i.e. \(H_0: p = 0.5\). After collecting our data of observed successes and failures, why not just stop when we calculate \(\hat{p}=0.81\)? Since \(0.81 > 0.5\), why can’t we just conclude that no, the probability of success is probably not really \(0.5\)?
To answer this question, think back on some of the different examples we’ve solved. Does this approach work in the baby weight example above? Would it work in our experiment of randomly selecting \(20\) jelly beans and counting the number of green ones? Would you be satisfied with this reasoning in the drinking water contamination example?
Data: \(\{x_1, x_2, \dots, x_{n_1}\}\) and \(\{y_1, y_2 \dots, y_{n_2}\}\)
Random variables: \(X\) and \(Y\)
Difference in population means \(\mu_1 - \mu_2\)
Difference in population proportions, \(p_1 - p_2\)
Are people who work for non-profits generally more satisfied with their jobs compared to those who work at for-profit companies?
Separate random samples were collected by a polling agency to investigate the difference. Data collected from \(422\) employees at non-profit organizations revealed. that \(377\) of them were “highly satisfied”. From the for-profit companies, \(431\) out of \(518\) employees reported the same level of satisfaction.
Step 1) Identify and define the population parameter and choose your significance level.
Step 2) State the null and alternative hypotheses.
Step 3) Assess the required assumptions and conditions.
Step 4) Calculate the test statistic and plot it.
Step 5) Shade the area under the curve that corresponds to your p-value and then calculate it.
On average, how much more money do consumers spend at Target compared to Walmart?
Suppose researchers collected a systematic sample from \(85\) Walmart customers and \(80\) Target customers by asking them for their purchase amount as they left the stores. The data they collected is summarized in the table below. Suppose a computer already calculated the degrees of freedom to be \(162.75\).
Walmart | Target | |
---|---|---|
\(\bar{x}\) | \(\$45\) | \(\$53\) |
s | \(\$21\) | \(\$19\) |
Step 1) Identify and define the population parameter and choose your confidence level.
Step 2) Calculate the sample estimate for the population parameter.
Step 3) Assess the required assumptions and conditions.
Step 4) Find the critical value corresponding to your confidence level.
Step 5) Calculate the standard error of your sample estimate.
Step 6) Calculate the lower and upper bounds of your confidence interval.
If you feel like you’re starting to get lost with all the different statistical methods we’ve explored in this unit, don’t fret! There are plenty of flow-chart-like guides to using different tests. For example, here is a flow-chart created by a stats grad student. You could also make your own! The starting point for using any of these kinds of guides is to first assess the data that you have (or want to have). This means you need to be able to answer questions like:
What constitutes an observational unit?
What are the variables and what are their types?
Do you expect there to be any relationships among the variables given your knowledge of the subject?
Is your sample random? If not, is your sample conceivably representative of any larger population?
Next class, we will finish these examples and cover a paired t-test for a difference in means. This is a special two-sample procedure that we use when our two samples are actually strongly connected in some way.
On Monday next week, you will have some time to work through an in-class worksheet with your classmates. In this worksheet, you will practice determining which method to use and setting up the solutions. You can then practice finding the solutions on your own as you study for Quiz 3.
Wednesday’s class next week will be devoted entirely to student questions. I’ll provide a quick recap of the material we’ve covered thus far in Unit 3 but otherwise, class time is for you to ask me questions.
Please check the calendar for updates. Group Homework #5 will be released this week. It is going to be graded for completion and is designed to help you practice the different software tools we’ve used throughout the semester. The idea is that in completing GHW #5, you will be refreshing your memory of the different methods we’ve learned about. This is especially helpful if you’re doing the final project Option B.