2.5 Model Development

Bio Data-Science

2.5.1 Model Development

  • Data Preparation: 🐧
  • Feature Engineering: New binary variables (isFemale)
  • Model Development: Linear Regression comparing different predictors
  • Model Evaluation / Prediction: on test-set
Bio Data-Science

Data Preparation

  • Load the data
  • Clean the data
  • Make sure what question You want to answer
  • Make plots to get an understanding of the data
Bio Data-Science

Feature Engineering

  • Create binary variables from categories
  • Normalize and scale variables is necessary
  • Create new features

Bio Data-Science

Model Development

  • Split the data in training and test or decide for another resampling approach
  • Decide what models could work
  • Decide how You want to measure if a model is good
  • Train models and compare their accuracy on the validation-set
train, test = train_test_split(penguins_cleaned, test_size=0.2, random_state=11)
Bio Data-Science

Model Evaluation / Prediction

  • Evaluate the models on a unseen test-set
  • Evaluate strength and weaknesses of the best models
Bio Data-Science

2.5.2 Resampling Methods

🎯 Learning objectives

You will be able to

  • what questions to evaluate based on training, validation and test-sets
  • perform LOO and k-fold cross-validation to create models that do not over-fit the training data
Bio Data-Science

🧠 Model & Feature Selection

  • we want to find the best model for the data
  • models can differ in
    • predictors / features that go into the model
    • model type (linear regression, decision tree, etc...)
    • models power to over-fit
    • hyper parameters (parameters, that tweak the model)
  • first, we focus in the the predictors / features
Bio Data-Science
Finding the best predictors / features
  • We have 66 possible predictor variables in the data set to predict body_mass_g:
    • species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, sex
  • Encoding the categorical variables binary we have p=8p=8 possible predictors
    • isAdelie, isGentoo , fromTorgersen, fromBiscoe, bill_length_mm, bill_depth_mm, flipper_length_mm, isFemale
  • from that we can build 2p=28=256{2^p} = {2^8}= 256 possible linear models:
    • body mass g=β0\text{body mass g}=\beta_0
    • body mass g=β0+β1×isAdelie\text{body mass g}=\beta_0 + \beta_1 \times \text{isAdelie}
    • body mass g=β0+β2×isGentoo\text{body mass g}=\beta_0+ \beta_2 \times \text{isGentoo}
    • ...
Bio Data-Science
What we found
  • It is hard to predict what is the best model in advance
  • We compared models MiM_i in different ways
    • Fit on the training set (R2R^2, RSSRSS)
    • Prediction accuracy
      • on the training
      • and test-set (MSEMSE)
    • Evaluation only on the training set can lead to over fitting
Bio Data-Science
Over-fitting the Training Data

Bio Data-Science
  • What we really want to know: What will be the best model on new unseen data?
    • We need to select models on a set that is not the training data
    • After selecting a model, we still want to test how well it performs on unseen data
  • validation-set
    • A set on which we compare trained models to find
    • the best features and ML Algorithms for the task
Bio Data-Science
🧠 We must divide the data into three parts

https://www.sxcco.com/?category_id=4128701
Bio Data-Science
Training-Set Validation-Set: Test-Set
for fitting the models or selecting the models (e.g., which parameters to include) hold-out set of data, to prove that we selected a good model for any data
Bio Data-Science

How do we split the data?

  • Often, we do not have much data, but ...
  • if we choose a
  • small training set
    • We only use a small proportion of the data to learn
    • If we have powerful models on sparse data, they tend to over-fit
  • small validation / test-set
    • set can be very special by chance
    • misleading results
https://www.sxcco.com/?category_id=4128701
Bio Data-Science

🧠 The Validation-Set Approach

  • split the data once
  • 70%70\% / 15%15\% / 15%15\% are common proportions
    data = [3, 7, 13, 15, 22, 25, 50, 91]
    
    training = [3, 7, 13, 50]
    validation = [15, 91]
    test = [22, 25]
    
  • note, that we can shuffle the folds randomly - train is not the first three!
Bio Data-Science
Problems with the Validation-Set Approach

  • test-set MSEMSE vs polynomial complexity of the model
  • Left: test-set MSEMSE one split, Right: test-set MSEMSE with ten different splits
  • Results depend on how we split the data
Bio Data-Science

🧠 Leave-One-Out Cross-Validation (LOO)

  • only put one observation in the test or validation-set
  • repeat this for all observations
data = [3, 7, 13, 15, 22, 25, 50, 91]

test = [22, 25]

training_1 = [7, 13, 15, 50, 91]
validation_1 = [3]

training_2 = [3, 13, 15, 50, 91]
validation_2 = [7]

...
Bio Data-Science
🧠 Averaging the Cross-Validation results
data = [3, 7, 13, 15, 22, 25, 50, 91]
  • With nn observations in the original data set
  • We get nn models MlM_l with different prediction errors on the validation-sets
  • We can still calculate the MSEMSE the same way
  • MSECV=1nl=1nMSElMSE^{CV}=\frac{1}{n}\sum_{l=1}^n MSE_l
Bio Data-Science

🧠 k-fold Cross-Validation

  • instead of nn folds with length 11, the data is split in kk folds of equal length
data = [3, 7, 13, 15, 22, 25, 50, 91]

test = [22, 25]

training_fold_1 = [13, 15, 50, 91]
validation_fold_1 = [3, 7]

training_fold_2 = [3, 7, 50, 91]
validation_fold_2 = [13, 15]

...
Bio Data-Science

https://medium.com/coders-mojo/quick-recap-most-important-projects-data-science-machine-learning-programming-tricks-and-c7d99a7a2391
Bio Data-Science
Averaging the Cross-Validation results
  • With kk folds
  • we get kk different MSElMSE_ls
  • MSEkCV=1kl=1kMSElMSE^{kCV}=\frac{1}{k}\sum_{l=1}^k{MSE_l}
Bio Data-Science

What to use?

  • Leave-One-Out (LOO) is a special case of n-folds (n=kn=k)
  • LOO is computationally more expensive
  • LOO has a smaller bias (we use more data)
  • LOO has a higher variance (the models we train are all
    very similar, while the test-set very different)
  • 🧠 using kFolds with k=5k = 5 or k=10k = 10 has been shown empirically
    to yield good results
Bio Data-Science
🧠 A practical approach to Cross-Validation

https://scikit-learn.org/stable/modules/cross_validation.html
Bio Data-Science
Training-Set Validation-Set Test-Set
used to train the models used to select the model, predictors X\vec{X} and/or parameters used to prove the models performance
5-fold CV (with Validation) 5-fold CV (with Training) 15%15\% cut out at the beginning
  • note that the data should be shuffled before the split
  • there are some cases, where this will not work (e.g. temporal data)
Bio Data-Science

2.5.3 Feature Selection

  • given we have 2p=28=256{2^p} = {2^8}= 256 possible linear models:
    • body mass g=β0\text{body mass g}=\beta_0
    • body mass g=β0+β1isAdelie\text{body mass g}=\beta_0 + \beta_1 \cdot \text{isAdelie}
    • body mass g=β0+β2isGentoo\text{body mass g}=\beta_0+ \beta_2 \cdot \text{isGentoo}
    • ...
  • how can we reliably find the best one?
  • we use 5-fold cross-validation to find the best features (predictors) and parameters
Bio Data-Science
Option 1: Best Subset Selection
  • brute force: train all possible models on the training set
  • compare the models performance on the validation-set
  • select the best model (e.g, based on validation MSEMSE)
  • With 5-fold cross-validation You will train 2852^8 \cdot 5 different models
Bio Data-Science
🤓 Option 2: Forward Selection
  • start he the null model (body mass g=β0\text{body mass g}=\beta_0)
  • calculate the best model with only one predictor XaX_a
  • calculate the best model containing XaX_a and one further predictor
  • ...

https://medium.com/codex/what-are-three-approaches-for-variable-selection-and-when-to-use-which-54de12f32464
Bio Data-Science
🤓 Option 3: Backward Selection
  • inversion of the forward selection
  • start with all predictors
  • drop the predictor with the smallest decrease in model accuracy
Bio Data-Science
🤓 More Options:
Bio Data-Science

🤓 Case Study

Try to find the best possible linear model for predicting the penguin weight using

  • best subset selection and
  • cross-validation
https://www.scoopnest.com/user/AFP/1035147372572102656-do-you-know-your-gentoo-from-your-adelie-penguins-infographic-on-10-of-the-world39s-species-after

5.1 Cross-Validation and Model Selection - Resampling

⌛ 40 minutes

Bio Data-Science

🎯 Learning Summary

Now, You can

  • what questions to evaluate based on training, validation and test-sets
  • perform LOO and k-fold cross-validation to create models that do not over-fit the training data
Bio Data-Science