I will use blue text for hypothetical questions. These questions are meant to make you think and the color is a reminder to me to give you some time to consider possible responses. These questions likely don’t have one correct answer but if you have a thought you’d like to share with the class, please do so!
Questions where I want an actual answer from you will be in purple. I will solicit responses from you in various ways. Sometimes your response will be privately recorded on a piece of paper to hand in or written on a worksheet that you complete with your classmates. Often, I will wait for a verbal response if these questions appear in the middle of a lecture.
A major objective of statistical analysis is to create reliable, reproducible, and relevant descriptions of carefully collected data. There are many ways to explore and describe different types of data and more yet to be invented.
What makes an exploratory statistical analysis “objective”?
Although exploratory statistical analyses are incredibly powerful in science and life, the best statistical analyses are ones that tell a story. Data alone can’t convince the way stories do.1 This is worth keeping in mind both when conducting your own exploratory analyses and especially when considering the results of an exploratory analysis by someone else.
“Maybe stories are just data with a soul.”
Recall:
Statistical stories are oriented around organized, observable forms of information… data! A data set consists of rows corresponding to the individual things being collected as “data points”, i.e. the observational units. The other feature of a data set is the columns which correspond to the stuff that’s being measured or counted, i.e. the variables.
Today, we’re going to cover a few ways to conduct an exploratory data analysis when the data consist of one or two variables, numeric, or categorical, or one of each.
What are we looking for in a histogram?
What kind of story can you tell about this data from the histograms above?
What’s the difference in the information contained in this boxplot and the information contained in a histogram?
##
## The decimal point is at the |
##
## -6 | 4
## -4 | 52
## -2 | 94
## -0 | 8655533322008877766444333332111
## 0 | 0112233444467990001122336678
## 2 | 11578
## 4 | 6
What’s the difference between the information contained in this stem-and-leaf plot and a histogram?
Is it honest to visualize transformed data?
Poll: How many of you are in your first/second/third/fourth (or more) years at Swarthmore?
## cur_sec
## First Second Third
## 31 2 1
## cur_sec
## First Second Third
## 0.91176471 0.05882353 0.02941176
By yourself: Take a moment and think about the colloquial meaning of the word “relationship”. E.g. How would you describe this word in a dictionary?
With a classmate: What does the word “relationship” entail in a statistical or data science context? Jot down your answers to hand in at the end of class.
How could we incorporate a third categorical variable for, say, lead or supporting role?
The 2x2 contingency table is a way to visualize data that consists of two categorical variables. Here is an (in)famous example of passengers aboard the ship Titanic categorized by those who survived the wreck and by the class of the passengers.4
Survived | First Class Passenger | Second Class Passenger | Third Class Passenger | Crew Member |
---|---|---|---|---|
Yes | 203 | 118 | 178 | 212 |
No | 122 | 167 | 528 | 673 |
Some questions we could answer from this contingency table include:
What is the probability that a passenger was in X class and did not survive?
What is the probability that a passenger survived given they were in X class?
What could evidence of a relationship look like in each of the above examples?
You can explore for yourself some of the vast, creative world of data visualization techniques at https://www.data-to-viz.com/.
Read your textbook! By next class, you should have read all of chapters 1-3 in your textbook.
Practice asking and answering questions in class and while you read your textbook or do your homework. By next class, you should have attempted Reading Comp Homeworks 1 and 2.
Attend Office Hours as often as possible, this time is for you so use it! My office hours can be found in our course calendar and syllabus.
Communicate with your professor. Do you not yet have your textbook? Access to Pearson MyLab? Can you not attend any of my office hours? Let me know ASAP!
Allow yourself room to grow. We learn and live at our own paces. You’re not perfect and you don’t have to be.
For more on this point read, for example, Data Alone Won’t Get You a Standing Ovation and “Stories are data with Soul” – lessons from black feminist epistemology.↩︎
Source: https://www.vulture.com/2012/08/meryl-streep-matrix.html↩︎
Source: https://www.r-bloggers.com/2019/12/boss-of-all-plots-box-plots/↩︎
Source: https://www.numerade.com/questions/the-contingency-table-below-shows-the-survival-data-for-the-passengers-of-the-titanic-beginarraycccc/↩︎