Quick aside on these notes

I will use blue text for hypothetical questions. These questions are meant to make you think and the color is a reminder to me to give you some time to consider possible responses. These questions likely don’t have one correct answer but if you have a thought you’d like to share with the class, please do so!

Questions where I want an actual answer from you will be in purple. I will solicit responses from you in various ways. Sometimes your response will be privately recorded on a piece of paper to hand in or written on a worksheet that you complete with your classmates. Often, I will wait for a verbal response if these questions appear in the middle of a lecture.

Introduction to exploratory data analysis

A major objective of statistical analysis is to create reliable, reproducible, and relevant descriptions of carefully collected data. There are many ways to explore and describe different types of data and more yet to be invented.

What makes an exploratory statistical analysis “objective”?

Although exploratory statistical analyses are incredibly powerful in science and life, the best statistical analyses are ones that tell a story. Data alone can’t convince the way stories do.1 This is worth keeping in mind both when conducting your own exploratory analyses and especially when considering the results of an exploratory analysis by someone else.

“Maybe stories are just data with a soul.”

– Brene Brown

1. How to explore numeric (or quantitative) data?

Recall:

Statistical stories are oriented around organized, observable forms of information… data! A data set consists of rows corresponding to the individual things being collected as “data points”, i.e. the observational units. The other feature of a data set is the columns which correspond to the stuff that’s being measured or counted, i.e. the variables.

Today, we’re going to cover a few ways to conduct an exploratory data analysis when the data consist of one or two variables, numeric, or categorical, or one of each.

Histogram

What are we looking for in a histogram?

What kind of story can you tell about this data from the histograms above?

Boxplot

What’s the difference in the information contained in this boxplot and the information contained in a histogram?

Stem and leaf plot

## 
##   The decimal point is at the |
## 
##   -6 | 4
##   -4 | 52
##   -2 | 94
##   -0 | 8655533322008877766444333332111
##    0 | 0112233444467990001122336678
##    2 | 11578
##    4 | 6

What’s the difference between the information contained in this stem-and-leaf plot and a histogram?

Other considerations for numeric data

Outliers

Transformaing or Re-expressing data

Is it honest to visualize transformed data?

2. How to explore categorical data?

Poll: How many of you are in your first/second/third/fourth (or more) years at Swarthmore?

Tables of Counts or Proportions

## cur_sec
##  First Second  Third 
##     31      2      1
## cur_sec
##      First     Second      Third 
## 0.91176471 0.05882353 0.02941176

Bar Chart

Pie Chart

3. How to explore relationships among multiple variables?

By yourself: Take a moment and think about the colloquial meaning of the word “relationship”. E.g. How would you describe this word in a dictionary?

With a classmate: What does the word “relationship” entail in a statistical or data science context? Jot down your answers to hand in at the end of class.

Scatterplot for at two numerical variables

2

How could we incorporate a third categorical variable for, say, lead or supporting role?

Stacked Boxplot for one numeric and one categorical variable

3

Contingency Table for Two (or more) Categorical variables

The 2x2 contingency table is a way to visualize data that consists of two categorical variables. Here is an (in)famous example of passengers aboard the ship Titanic categorized by those who survived the wreck and by the class of the passengers.4

Survived First Class Passenger Second Class Passenger Third Class Passenger Crew Member
Yes 203 118 178 212
No 122 167 528 673

Some questions we could answer from this contingency table include:

  • What is the probability that a passenger was in X class and did not survive?

  • What is the probability that a passenger survived given they were in X class?

What could evidence of a relationship look like in each of the above examples?

You can explore for yourself some of the vast, creative world of data visualization techniques at https://www.data-to-viz.com/.

4. Reminders about this class


  1. For more on this point read, for example, Data Alone Won’t Get You a Standing Ovation and “Stories are data with Soul” – lessons from black feminist epistemology.↩︎

  2. Source: https://www.vulture.com/2012/08/meryl-streep-matrix.html↩︎

  3. Source: https://www.r-bloggers.com/2019/12/boss-of-all-plots-box-plots/↩︎

  4. Source: https://www.numerade.com/questions/the-contingency-table-below-shows-the-survival-data-for-the-passengers-of-the-titanic-beginarraycccc/↩︎