2.3 Linear Regression

Bio Data-Science

🎯 Learning objectives

You will be able to

  • Students can implement a Linear Regression in Python
  • Students can interpret coefficients and p-values of an regression model
  • Students describe the accuracy of an model using R2R^2
Bio Data-Science

Motivation

  • Linear Regression is supervised and parametric
  • Interpretation:
    • Is there a relationship between advertising budget and sales?
    • How strong is the relationship between advertising budget and sales?
Bio Data-Science

  • Interpretation:
    • Which media contribute to sales?
    • How accurately can we estimate the effect of each medium on sales?
    • How accurately can we predict future sales?
    • Is the relationship linear?
    • Is there synergy among the advertising media?
Bio Data-Science

🧠 Simple Linear Regression

Yβ0+β1X1Y ≈ β_0 + β_1 \cdot X_1

salesβ0+β1TV\text{sales} ≈ β_0 + β_1 \cdot \text{TV}

  • coefficients or parameters:

    • β0β_0 ... intercept
    • β1β_1 ... slope
  • We estimate the parameters based on training data

Yβ^0+β^1X1Y ≈ \hat{β}_0 + \hat{β}_1 \cdot X_1

Bio Data-Science

🧠 Estimating the Coefficients

  • model: y^j=β^0+β^1x1,j\hat{y}_j = \hat{β}_0 + \hat{β}_1 \cdot x_{1,j}
  • prediction error: ej=yjy^je_j = y_j−\hat{y}_j
  • residual sum of squares (RSS) over all observations
    RSS=e12+e22++en2RSS = e^2_1 + e^2_2 + · · · + e^2_n
  • best model: min(RSS)\min(RSS)
Bio Data-Science

✍️ What is the relationship between MSEMSE and RSSRSS?

  • Mean Squared Error
    MSE=1nj=1n(yjy^j)2MSE=\frac{1}{n}\sum_{j=1}^n(y_j−\hat{y}_j)^2
  • Residual Sum of Squares
    RSS=j=1n(yjy^j)2RSS= \sum_{j=1}^n(y_j−\hat{y}_j)^2
Bio Data-Science

🤓 There is an analytical solution

  • to find the parameters that have the minimal RSSRSS given the training data (X,YX,Y)

β^1=j=1n(xjxˉ)(yjyˉ)j=1n(xjxˉ)2\hat{β}_1 = \frac{\sum_{j=1}^n{(x_j-\bar{x})(y_j-\bar{y})}}{\sum_{j=1}^n{(x_j-\bar{x})^2}}

β^0=yˉβ1xˉ\hat{β}_0 = \bar{y}-\beta_1\bar{x}

  • xˉ=1nj=1nx1,j\bar{x} = \frac{1}{n}\sum_{j=1}^n x_{1,j}
  • yˉ=1nj=1nyj\bar{y} = \frac{1}{n}\sum_{j=1}^n y_{j}
You can find the proof in any textbook or wikipedia. Later, we will discuss alternative algorithm to find the solution
Bio Data-Science

Same Data, different Solutions

  • red: true relationship of the simulated data
  • grey: random sample drawn from the true relationship
  • blue: estimated regressions lines based on different samples (training sets)
  • How sure can we be about the values of β0\beta_0 and β1\beta_1?
Bio Data-Science

Standard Error SE(β),σ(β)SE(\beta), \sigma(\beta)

  • is a measure for the average amount that an estimate (e.g., β^0\hat{\beta}_0) differs from the actual value of β0\beta_0 and depends on the variance σ2σ^2 in the data and the number of observations nn on the sample
  • where σσ is the standard deviation of each of the realizations yjy_j of YY
  • 🧠 there is approximately a 95% chance that the interval

    βi±2SE(β^i)β_i ± 2 · SE(\hat{\beta}_i)

    will contain the true value of βiβ_i. (given a large enough sample size nn and normal distributed errors)
  • This is used as a confidence interval.
Bio Data-Science

Normal Distribution

Bio Data-Science
Example TV advertising data

salesβ^0+β^1TV\text{sales} ≈ \hat{β}_0 + \hat{β}_1 \cdot \text{TV}

sales7.0325+0.0475TV\text{sales} ≈ 7.0325 + 0.0475 \cdot \text{TV}

  • 95%95 \% confidence interval for β^0\hat{β}_0 [6.130,7.935][6.130, 7.935]

  • 95%95 \% confidence interval for β^1\hat{β}_1 [0.042,0.053][0.042, 0.053]

  • without any advertising, sales will, on average, fall between 6,1306{,}130 and 7,9407{,}940 units.

  • for each 1,0001{,}000 $ increase in TV advertising, there will be an average increase in sales of between 4242 and 5353 units.

Bio Data-Science

Hypothesis Tests in on Parameters

Is there really a influence of advertising on sales?

  • H0H_0: There is no relationship between XX and YY
    H0:β1=0H_0: β_1 = 0
  • HaH_a: There is some relationship between XX and YY
    Ha:β10H_a: β_1 \ne 0
Bio Data-Science

🤓 Is β^1\hat{β}_1 far enough from 00 that we can reject H0H_0?

  • From the data we know / can calculate:
    • β^1=0.0475\hat{β}_1 = 0.0475
    • SE(β^1)=0.0027SE(\hat{\beta}_1)=0.0027
    • So, we are pretty sure that the real β1\beta_1 is in [0.042,0.053][0.042,0.053]
  • how sure are we?
    • based on the SESE more then 95%95\%
    • We can use a t-Test to calculate, how likely it is have a β1\beta_1 of 00 given our data:
    • t=β^10SE(β^1)=17.6t=\frac{\hat{β}_1-0}{SE(\hat{β}_1)}=17.6

https://www.geogebra.org/m/b85v7zww, df=np1df = n-p-1
Bio Data-Science

🤓

  • the probability of observing a t>17.6t>17.6 or t<17.6t<17.6, given the real β1=0\beta_1=0 is very low
  • we call this probability the pp-value
  • a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and response due to chance
  • We reject the null hypothesis H0:β1=0H_0: β_1 = 0 — that is, we declare a relationship to exist between XX and YY — if the pp-value is small enough (usually 5%5\%)
Bio Data-Science

🧠 Regression Tables

salesβ0+β1TV\text{sales} ≈ β_0 + β_1 \cdot \text{TV}

After fitting the model, the models parameters can be found in a regression table.

  • β0=7.0325β_0=7.0325, SE(β^0)=0.4578SE(\hat{\beta}_0)=0.4578
  • β1=0.0475β_1=0.0475, SE(β^1)=0.0027SE(\hat{\beta}_1)=0.0027
  • as p<0.05p<0.05, we say: TV adds have a significant positive correlation with sales
Bio Data-Science

✍️ Real Regression Tables

  • what is the predicted variable?
  • how large is the intercept?
  • is there a significant influence of femur length on the predicted variable?
OLS - Ordinary least squares = Regular Linear Regression
Bio Data-Science

Assessing the Accuracy of the Model

Does the model fit the data?

Bio Data-Science

🤓 Residual Standard Error

Y=β0+β1X+ϵY = β_0 + β_1 \cdot X + \epsilon

  • There will always be an error ϵ\epsilon, even if we know β0β_0 and β1β_1 perfectly
  • Residual Standard Error (RSERSE) is an estimate of the standard deviation of ϵ\epsilon
  • average amount that the response will deviate from the true regression line
Bio Data-Science

RSE=1n2RSS=1n2(e12+e22++en2)RSE=\sqrt{\frac{1}{n-2}RSS}=\sqrt{\frac{1}{n-2}(e^2_1 + e^2_2 + · · · + e^2_n)}

=1n2j=0n(yjy^j)2=\sqrt{\frac{1}{n-2}\sum_{j=0}^n(y_j−\hat{y}_j)^2}

  • RSE=0RSE=0 ... perfect fit
Bio Data-Science

RSETV=3.260RSE_{TV}=3.260

even if the model was correct and the true values of the unknown coefficients β0β_0 and β1β_1 were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3.2603.260 units on average.

Bio Data-Science

🧠 R2R^2 Statistic

  • RSERSE provides an absolute measure of the fit (e.g., 3.2603.260 units)
  • R2R^2 provides a proportion between 00 and 11
  1. R2=TSSRSSTSS=1RSSTSSR^2 = \frac{TSS-RSS}{TSS} = 1- \frac{RSS}{TSS}

  • where TSS=(yjyˉ)2TSS=\sum{(y_j-\bar{y})^2} is the total sum of squares
  • where yˉ\bar{y} is the sample mean
Bio Data-Science
  • R2R^2 measures the proportion of variability in YY that can be explained using XX and the linear model (instead of just using the sample mean)
  • R20R^2 \approx 0: regression did not explain much of the variability in the response
  • in the TV-example R2=0.61R^2=0.61: two-thirds of the variability in sales is explained by a linear regression on TV
Bio Data-Science
🧠 Interpretation of R2R^2

https://www.datasciencecentral.com/wp-content/uploads/2021/10/2742052271.jpg
Bio Data-Science

🧠 Be careful!

  • no good measure for prediction accuracy!
  • RSSRSS or MSEMSE is identical, still R2R^2 is very different
  • Especially as it is calculated on the test set
https://www.enmanreg.org/r2/r2_vs_cvrmse/
Bio Data-Science

Bio Data-Science

✍️ Case Study

https://ocean.si.edu/ocean-life/seabirds/penguins
Bio Data-Science
  • Is there any relationship between two variables (for any species and penguin), where You do not expect a linear correlation?
  • If so (or not), is the model a good predictor?

3 Linear Regression

⌛ 35 minutes

Bio Data-Science