Simple Linear Regression (SLR)

class: center, middle, inverse, title-slide

# Simple Linear Regression (SLR)
### STAT 021 with Prof Suzy
### Swarthmore College

---

.scroll-output {
  height: 70%;
  overflow-y: scroll;
}

.scroll-small {
  height: 30%;
  overflow-y: scroll;
}
   
.red{color: #ce151e;}
.green{color: #26b421;}
.blue{color: #426EF0;}
</style>

## A simple statistical model

`$$Y\mid x = f(x) + \epsilon$$`

- `$f$` is a smooth function.
  
  - In linear regression, we consider functions with linear coefficients. These coefficients are our model parameters. (I.e. `$f$` is just the equation for a line.)

- `$x$` is a fixed/known covariate.

- `$\epsilon$` is some random measurement error. Note that, against convention, even though this is a Greek letter, `$\epsilon$` represents a **random variable**!

---
## Simple linear regression

Statistical convention represents a regression line with a intercept, `$\beta_0$`, and a slope `$\beta_1$` so that we have the following **simple linear regression model**:
`$$Y\mid x = \beta_0 + \beta_1 x + \epsilon.$$`

- `$Y$` is the response (output) variable. We assume that there is some random error associated with our observations of `$Y$`.
- `$x$` is a predictor (explanatory, covariate, input) variable. We assume there is **no** random error associated with `$x$`, i.e. that all values of `$x$` are fixed, so it's not a random variable.  
- The behavior of `$Y$` is modeled conditional upon the predictor `$x$`.
- `$\beta_{0}$`, `$\beta_{1}$` are the regression model coefficients (the intercept and slope, respectively).

***

Compare this to the the typical algebraic notation for the equation of a line: 
`$$Y = ax + b.$$`

For more information on interpreting negative intercept values <a href="https://statisticsbyjim.com/regression/interpret-constant-y-intercept-regression/">go here</a>.

---
## Simple linear regression

`$$Y \mid x = \beta_0 + \beta_1 x + \epsilon.$$`

It's called a linear model because `$f$` is linear with respect to the coefficients `$\beta_{i}$`, for `$i=1,2$`.

**Question:** Which of the following are linear models?

1. `$Y = \beta_{0} + \beta_{1}x^2 + \epsilon$`

1. `$Y = \sqrt{\beta_{0} + \beta_{1}x} + \epsilon$`

---
## Simple linear regression

`$$Y \mid x = \beta_0 + \beta_1 x + \epsilon.$$`

It's called a linear model because `$f$` is linear with respect to the coefficients `$\beta_{i}$`, for `$i=1,2$`.

**Question:** Which of the following are linear models?

1. `$Y = \beta_{0} + \beta_{1}x^2 + \epsilon$` (this is!)

1. `$Y = \sqrt{\beta_{0} + \beta_{1}x} + \epsilon$` (not this!)

---
## Simple linear regression

`$$Y \mid x = \beta_0 + \beta_1 x + \epsilon.$$`

For now, we are only going to consider the case where `$x$` and `$Y$` both represent **quantitative, continuous** random variables.

We will be generalizing this SLR (simple linear regression) model to cases where

- X is a discrete and quantitative variable; 
  
  - X is a categorical variable (ANOVA);

- We have more than just one predictor variable (MLR);

- Y is a binary variable (logistic regression) - time permitting.

---
## Simple linear regression

In SLR, the data we observe are pairs `$(x_{1},y_{1}), \dots, (x_{n},y_{n})$`, of continuous, quantitative variables.

The model `$Y \mid x = \beta_0 + \beta_1 x + \epsilon$` means that we are assuming 
`$$y_{i} = \beta_0 + \beta_1 x_{i} + \epsilon_{i},$$`
for each data point we observe where `$\epsilon_{i}$` represents an (unobserved) measurement error associated with our response variable.

---
## Simple linear regression

`$$Y \mid x = \beta_0 + \beta_1 x + \epsilon$$`

**Assumptions**

- For estimation: The measurement error has mean `$E[\epsilon]=0$` and (unknown) variance `$Var[\epsilon]=\sigma^2$` and all measurement errors are independent of each other.

- For inference: If we wish to conduct statistical inference, we must also assume that the measurement error, `$\epsilon$`, follows a standard normal distribution.

--
**Question:** What do theses assumptions imply about `$Y$`?

--
**Another question:** What if there was no random error in our observations of `$Y$`? How do we find the line of best fit in this case?