2.4 Multiple Regression

Bio Data-Science

2.4.1 Multiple Linear Regression

🎯 Learning objectives

You will be able to

  • implement and interpret Linear Regression models with multiple independent variables
  • Students can calculate the VIF to remove colinear predictors
  • name different approaches for variable selection
Bio Data-Science

Need für Multiple Regression

How do the total sales relate to the three different advertising budgets?

sales=β0+β1×TV+β2×radio+β3×newspaper+ϵ\text{sales} = β_0 + β_1 × \text{TV} + β_2 × \text{radio} + β_3 × \text{newspaper} + \epsilon

y^=β^0+β^1X1+β^2X2++β^pXp\hat{y} = \hat{β}_0 + \hat{β}_1 X_1 + \hat{β}_2 X_2 + · · · + \hat{β}_p X_p

Bio Data-Science

🧠 Interpreting Multiple Regression

  • only newspaper adds show no significant correlation to sales
Bio Data-Science
Take care when comparing different models

  • If we only consider newspaper adds, they have a significant correlation
  • the single equation ignores the other two media in forming estimates for the regression coefficients.
Bio Data-Science
Visualization

Bio Data-Science

2.4.2 Variable Selection

  • which predictors should we include in the model?
    • sales=β0+β1×TV+β2×radio+β3×newspaper+ϵ\text{sales} = β_0 + β_1 × \text{TV} + β_2 × \text{radio} + β_3 × \text{newspaper} + \epsilon
    • sales=β0+β1×TV+β2×radio+ϵ\text{sales} = β_0 + β_1 × \text{TV} + β_2 × \text{radio} + \epsilon
    • sales=β0+β1×TV+β3×newspaper+ϵ\text{sales} = β_0 + β_1 × \text{TV} + β_3 × \text{newspaper} + \epsilon
    • sales=β0+β2×radio+β3×newspaper+ϵ\text{sales} = β_0 + β_2 × \text{radio} + β_3 × \text{newspaper} + \epsilon
    • sales=β0+β1×TV+ϵ\text{sales} = β_0 + β_1 × \text{TV} + \epsilon
    • sales=β0+β2×radio+ϵ\text{sales} = β_0 + β_2 × \text{radio} + \epsilon
    • sales=β0+β3×newspaper+ϵ\text{sales} = β_0 + β_3 × \text{newspaper} + \epsilon
Bio Data-Science

🧠 Model Selection

  • there are 2p2^p models that contain subsets of pp variables.
  • there are different criteria for model quality (MSEMSE, R2R^2, ...)
  • Approaches:
    • Forward selection: Start with null model (only intercept β0\beta_0) and add variables
    • Backward selection: Start with all variables and remove variables
  • More in the next session
Bio Data-Science

🤓 Collinearity

  • two or more predictor variables are closely related to one another
  • makes difficult to separate out the individual effects
  • reduces the accuracy of the estimates of the regression coefficients
  • You can spot them in scatter plot matrices
https://medium.com/analytics-vidhya/new-aspects-to-consider-while-moving-from-simple-linear-regression-to-multiple-linear-regression-dad06b3449ff
Bio Data-Science
🤓 Variance inflation factor (VIF).
  • Even better: compares how much a variable jj
    adds the to problem
    VIFj=11Rj2VIF_j=\frac{1}{1-R_j^2}

  • The smallest possible value for VIF is 1,
    which indicates the complete absence
    of collinearity.

  • VIF>5 or 10VIF > 5 \text{ or } 10 indicates a problematic amount of collinearity.

  • What to do:

    • Check VIF for all possible predictors
    • drop the predictor with the highest VIFVIF
Rj2R_j^2 is the result of regressing jj on all other predictors
Bio Data-Science

Case Study

If we would like to create a model, that predicts the flipper length not only from bill length but also bill depth, we need to define a model with multiple predictors.

https://www.scoopnest.com/user/AFP/1035147372572102656-do-you-know-your-gentoo-from-your-adelie-penguins-infographic-on-10-of-the-world39s-species-after
Bio Data-Science

✍️ Case Study

4.1 Multiple Linear Regression

⌛ 25 minutes

Bio Data-Science

2.4.3 Qualitative Predictors and further Extensions

🎯 Learning objectives

You will be able to

  • integrate qualitative predictors in regression models
  • model non-linear relationships with linear regression
  • name common problems when working with linear regression
Bio Data-Science

🧠 Qualitative Predictors

  • so far the models only included quantitative (interval scaled) predictors

    body height=β0+β1femur length+ϵ\text{body height} = \beta_0 + \beta_1 \cdot \text{femur length} + \epsilon

  • how do we deal with qualitative data

    body height=β0+β1femur length+β2sex+ϵ\text{body height} = \beta_0 + \beta_1 \cdot \text{femur length} + \beta_2 \cdot \text{sex} + \epsilon

Bio Data-Science
🧠 Predictors with only Two Levels
  • x1,jx_{1,j} femur length of person jj

  • x2,j={1if jth person is female0if jth person is malex_{2,j} = \begin{cases} 1 & \text{if jth person is female} \\ 0 & \text{if jth person is male} \end{cases}

yj=β0+β1x1,j+β2x2,j+ϵ={β0+β1x1,j+β2x2,j+ϵif jth person is femaleβ0+β1x1,j+ϵif jth person is maley_j = \beta_0 + \beta_1 x_{1,j} + \beta_2 x_{2,j} + \epsilon= \begin{cases} \beta_0 + \beta_1 x_{1,j} + \beta_2 x_{2,j} + \epsilon & \text{if $j$th person is female}\\ \beta_0 + \beta_1 x_{1,j} + \epsilon & \text{if $j$th person is male}\\\end{cases}

Bio Data-Science
🧠 Interpretation of β2\beta_2

yj=β0+β1x1,j+β2x2,j+ϵ={β0+β1x1,j+β2x2,j+ϵif jth person is femaleβ0+β1x1,j+ϵif jth person is maley_j = \beta_0 + \beta_1 x_{1,j} + \beta_2 x_{2,j} + \epsilon= \begin{cases} \beta_0 + \beta_1 x_{1,j} + \beta_2 x_{2,j} + \epsilon & \text{if $j$th person is female}\\ \beta_0 + \beta_1 x_{1,j} + \epsilon & \text{if $j$th person is male}\\\end{cases}

  • if β2\beta_2 is negative:
    • a woman will be shorter with the same femur length
Bio Data-Science

🧠 Qualitative Predictors with more than two Levels

  • if the data contains the following ethnicities: Asian, Caucasian, African American
  • we would need two additional variables:

x3,j={1if jth person is Asian0if jth person is not Asianx_{3,j} = \begin{cases} 1 & \text{if $j$th person is Asian}\\ 0 & \text{if $j$th person is not Asian}\\\end{cases}

x4,j={1if jth person is Caucasian0if jth person is not Caucasianx_{4,j} = \begin{cases} 1 & \text{if $j$th person is Caucasian}\\ 0 & \text{if $j$th person is not Caucasian}\\\end{cases}

  • in the models base line, the person is African American so x3=x4=0x_3 = x_4 = 0
Bio Data-Science

Example: Input Features that are not numerical

id body height femur length sex ethnicity
1 185 49 male caucasian
2 176 43 female asian
3 179 33 female african american
... ... ... ...
  • most models only can work with numerical variables
Bio Data-Science
Dummy Encoding
id body height femur length is_female is_asian is_caucasian
1 185 49 0 0 1
2 176 43 1 1 0
3 179 33 1 0 0
... ... ... ... ...
  • n1n-1 new variables for each category in the original variable
  • there is a baseline model (e.g., is male and african american)
  • common in linear regression
Bio Data-Science
One-Hot Encoding
id body height femur length is_female is_male is_asian is_caucasian is_african_american
1 185 49 0 1 0 1 0
2 176 43 1 0 1 0 0
3 179 33 1 0 0 0 1
... ... ... ... ... ... ...
  • nn new variables for each category in the original variable
  • common in other machine learning models
Bio Data-Science
🧠 Example of a Regression Table with Dummy Encoding

  • the only predictor in XX is ethnicity (three classes)
  • the baseline is African American, where the intercept (predicted value YY is 531531)
  • for both other ethnicity, the model expects a lower value
  • however, the correlation is not significant (p>0.05p>0.05)
Bio Data-Science

✍️ Case Study

4.2 Applications of Linear Regression

Bio Data-Science

🤓 2.4.4 Extensions of the Linear Model

  • Linear Regression is a simple, yet powerful tool
    • it's the go-to tool in research to evaluate correlation in data
    • it's a commonly used model a a baseline for more sophisticated models
  • There are two common extensions, that make it even more powerful
Bio Data-Science

🤓 Removing the Additive Assumption

In reality, is it better to spent all the marketing budget on TV or to split it?

  • this model is purely additive

    sales=β0+β1×TV+β2×radio+β3×newspaper+ϵ\text{sales} = β_0 + β_1 × \text{TV} + β_2 × \text{radio} + β_3 × \text{newspaper} + \epsilon

  • we assume the effects on the sales to be independent
  • in reality, we expect synergies (e.g, when people see the add twice)
Bio Data-Science
🤓 by adding Interaction Terms
  • We can model this, using an interaction term (TV×radio\text{TV} × \text{radio})

    sales=β0+β1×TV+β2×radio+β3×newspaper+β4×TV×radio+ϵ\text{sales} = β_0 + β_1 × \text{TV} + β_2 × \text{radio} + β_3 × \text{newspaper} + β_4 × \text{TV} × \text{radio} + \epsilon

  • Interpretation: There is an additional positive effect, if we increase both
    (TV and radio adds)
Bio Data-Science
🤓 Interactions with qualitative Predictors

prediction of bank account balance from income

  • Left Model:

    balance=β0+β1×income+β2×is student\text{balance} = β_0 + β_1 × \text{income} + β_2 × \text{is student}

  • Right Model:

    balance=β0+β1×income+β2×is student+β3×is student×income\text{balance} = β_0 + β_1 × \text{income} + β_2 × \text{is student} + β_3 × \text{is student} × \text{income}

Bio Data-Science

🤓 Modeling non-linear Relationships with Linear Models

the linear model is no good predictor for the relationship

Bio Data-Science
🤓 Linear Regression Model can capture non linear Relationships
  • polynomial regression:

    mpg=β0+β1×horsepower+β2×horsepower2+ϵmpg = β_0 + β_1 × \text{horsepower} + β_2 × \text{horsepower}^2 + \epsilon

  • Drawbacks
    • we have to decide for the degree of the polynom (parametric model)
    • it is harder to interpret
    • there are more flexible models
Bio Data-Science

✍️ Case Study: Data Science Project with Linear Regression

Bio Data-Science

Imagine researching penguin in the Antarctic caused Your scale to freeze an break. However, You still want to get an estimate of the penguins weight.

Luckily, Your colleagues left You with a data set (train) with all the other variables you can measure.

Create a multiple linear regression model to predict the penguins weight. Compare the Mean Squared Error of the model in the test set with your colleagues.

https://www.scoopnest.com/user/AFP/1035147372572102656-do-you-know-your-gentoo-from-your-adelie-penguins-infographic-on-10-of-the-world39s-species-after
Bio Data-Science

4.3 Case Study

⌛ 45 minutes

Bio Data-Science