2.2 Statistical Learning

Bio Data-Science

2.2.1 Basics of Statistical Learning

🎯 Learning objectives

You will be able to

  • categorize problems in supervised and unsupervised learning
  • differentiate problems of prediction and interpretation
  • explain the purpose of a training and test set
Bio Data-Science

🧠 Definitions

  • Data Science or Statistical Learning refers to a vast set of tools for understanding data
  • supervised statistical learning involves building a statistical model for predicting,
    or estimating, an output based on one or more inputs
https://www.finbridge.de/data-science
Bio Data-Science
Example Supervised: Wage Data (Input and Output)

  • How does output YY (wage) relate to input X\vec{X} (age, year, education)
[Hasties]
Bio Data-Science

A Brief History of Statistical Learning

  • 1800s Legendre and Gauss: Linear Regression
  • 1970s Generalized
    linear models (non-linear relationships was computationally infeasible)
  • 1980s Classification and
    regression trees
  • 2000s Machine Learning
https://zakharus.medium.com/data-science-timeline-305ef75dceb6
Bio Data-Science
1800s Legendre and Gauss: Linear Regression
  • describing the relation between two variables with the best fitting curve
Bio Data-Science
1870-1970s Classical Statistics
  • formalized statistical test
  • t-test e.g., William Sealy Gosset's 1908 paper in Biometrika working for Guinness Brewery
Bio Data-Science
2000s Machine Learning
  • complex data points (sentences, images, DNA-sequences)
    instead of numeric variables
  • models with "uncountable" parameters, fitted to the data with computing power
https://www.medrxiv.org/content/10.1101/2020.07.17.20155150v1.full
Bio Data-Science
2010s "Artificial Intelligence"

  • deep neural networks (efficient matrix multiplication on GPUs)
  • Generative Adversarial Networks that learn to produce
  • Reinforcement Learning
https://openai.com/dall-e-2/
Bio Data-Science

Structure of Supervised Learning Problems

  • model (ff) describes how one or many predictors (X\vec{X})
    relate to a predicted variable YY: f(X)=Yf(\vec{X}) = Y
X\vec{X} YY ff
Speed Heart Rate Linear Regression
Expression values of different genes Diagnosis of a disease kNN-Classificator
Input text Pixels in a picture Deep Learning
Bio Data-Science

🧠 Nomenclature

  • X\vec{X} input variables (predictors, independent variables, features)
    • e.g, age, years on the job, education
  • YY output variable (response or predicted/dependent variable)
    • e.g., income
we use large letters, when we speak about the random variable, [Hasties]
Bio Data-Science
id YY: Income X1X_1: Years of Education X2X_2: Seniority X2X_2: Age
1 45000 10 7 34
2 50000 20 5 63
... ... ... ...
Bio Data-Science
  • We assume that there is some relationship between YY and and pp different predictors, X=X1,X2,...,Xp\vec{X}= X_1,X_2, . . .,X_p.
  1. Y=f(X)+ϵY = f(\vec{X}) + \epsilon

  • ff is a fixed but unknown function of X1,...,XpX_1, . . . , X_p,
  • ϵ\epsilon is a random error term, which is independent of X\vec{X} and has mean zero.
  • ff represents the systematic information that X\vec{X} provides about YY
Bio Data-Science

Applications

income=β0+β1education+β2senority+β3age\text{income} = \beta_0 + \beta_1 \cdot \text{education} + \beta_2 \cdot \text{senority}+\beta_3 \cdot \text{age}

  • Why estimate ff?
  • we build models to
    • predict the future
      what should we offer out next employee?
    • understand the world (interpretation)
      should it do a master?
Bio Data-Science
Prediction
  • We want to make a prediction about
    • the future - stock prices
    • an unknown property - does the patient have cancer?
  1. Y^=f^(X)+ϵ,\hat{Y} = \hat{f}(\vec{X}) +\epsilon,

    • a hat symbol ( ^\hat{} ) indicates a prediction
    • f^\hat{f} is often a black box, in machine learning the model is a prediction based on the data
    • We probably make an prediction error ϵ\epsilon
    • We are interested in the accuracy of Y^\hat{Y}
Bio Data-Science

Prediction Example

  • Can we predict the income based on the
    education and years of seniority in the job
id YY: Income X1X_1: Years of Education X2X_2: Seniority
1 45000 10 7
2 50000 20 5
... ... ...
Bio Data-Science
  • the model f^\hat{f} is the blue area
  • we can plug in eduction (X1X_1) and seniority (X2X_2)
  • to get an prediction on the income (Y^\hat{Y})
  • we do not care what f^\hat{f} looks like as long we make small errors ϵ\epsilon for each observation (red dot)
[Hasties]
Bio Data-Science
Interpretation
  • how is YY affected as X1,...,XpX_1, . . . , X_p change?
  • f^\hat{f} must not be a black box, because we need to know its exact form
  • We are interested in the parameters β0,β1,...,βp\beta_0, \beta_1, . . . , \beta_p of the model
  • How strong is the relationship between the response and each predictor?
    • How much more will a person earn for each year of education?
  • Which predictors are associated with the response?
    • Does seniority have any impact at all?
  • Can the relationship between YY and each predictor be adequately summarized
    using a linear equation, or is the relationship more complicated?
    • Will the income increase in a stable way?
    • Is there a safe level of alcohol consumption?
Bio Data-Science
✍️ Interpretation or prediction?
  • 🟥: prediction
  • 🟨: interpretation
  • What's the gas price next summer?
  • Do students in the front row get better grades?
  • Given an x-ray picture, does the patient have lung cancer?
  • What has the lager influence on weight: sugar or fat intake?
Bio Data-Science

🧠 How do we estimate ff?

  • training data (nn observations of our sample, rows in a table) to train, or teach, our method how to estimate ff
    (e.g., income based in education and seniority)
  • xi,jx_{i,j}
    • the value of the jjth of pp predictors
    • for observation ii of nn observations
    • where j=1,2,...,pj = 1, 2, . . ., p and i=1,2,...,ni = 1, 2, . . . , n.
  • xi=(xi,1,xi,2,...,xi,p)T\vec{x}_i = (x_{i,1}, x_{i,2}, . . . , x_{i,p})^T
    • We use the small xi\vec{x}_i to indicate the vector of all predictors of observation ii (i.e., a row)
  • Xj=(x1,j,x2,j,...,xn,j)X_j= (x_{1,j}, x_{2,j}, . . . , x_{n,j})
    • We use the capital XjX_j to indicate the vector of the nn values of the jjth predictor (i.e., a column)
Bio Data-Science
  • yiy_i response variable for the iith observation.
  • Then our training data consist of
    (x1,y1),(x2,y2),...,(xn,yn){(\vec{x}_1, y_1), (\vec{x}_2, y_2), . . . , (\vec{x}_n, y_n)}
  • Our goal is to apply a statistical learning method to the training data
    in order to estimate the unknown function ff.
  • so that yjf^(xj)y_j ≈ \hat{f}(\vec{x}_j) for any observation ii.
  • There are parametric and non-parametric methods (compare blue planes in the prediction example)
Bio Data-Science

2.2.2 Parametric and non-parametric Models

🎯 Learning objectives

You will be able to

  • differentiate parametric and non-parametric models
  • calculate accuracy measures of regression models
  • interpret in (training-set) and out-of-sample (test-set) accuracy in terms of model flexibility, bias, variance, and over-fitting
Bio Data-Science

🧠 Parametric Methods

  • Step 1: assumption about the functional form or shape of ff.

    • e.g., a simple linear form
      f(X)=β0+β1X1+β2X2+...+βpXp.f(\vec{X}) = β_0 + β_1X_1 + β_2X_2 + ... + β_pX_p.
  • Step 2: training data to fit the model.

    • we need to estimate the parameters β0,β1,...,βpβ_0, β_1, . . . , β_p.
    • find values of these parameters such that
      Yβ0+β1X1+β2X2+...+βpXp.Y ≈ β_0 + β_1X_1 + β_2X_2 + . . . + βpXp.
Xi:X_i: Variable ii
Bio Data-Science

Parametric model of income explained by years of education and years on the job (seniority)

incomeβ0+β1education+β2seniority\text{income} ≈ β_0 + β_1 \cdot \text{education}+ β_2 \cdot \text{seniority}

Bio Data-Science

Matrix Representation for Linear Models

  • xi,jx_{i,j}… value of predictor jj of observation ii
  • X0=x:,0X_0=x_{:,0}… are all 11 so xj,0β0=β0x_{j,0} \cdot \beta_0 = \beta_0 is the bias term
    (i.e., the intercept of the linear regression model)
Bio Data-Science
Bio Data-Science

incomeβ0+β1education+β2seniority\text{income} ≈ β_0 + β_1 \cdot \text{education}+ β_2 \cdot \text{seniority}

y=[40.00050.00060.00040.00090.000]y= \begin{bmatrix} 40.000 \\ 50.000 \\ 60.000 \\ 40.000 \\ 90.000 \\ \end{bmatrix}

X=[1101113411651201511320]X = \begin{bmatrix} 1 & 10 & 1 \\ 1 & 13 & 4 \\ 1 & 16 & 5 \\ 1 & 20 & 15 \\ 1 & 13 & 20 \\ \end{bmatrix}

Bio Data-Science

y=Xβ=[1101113411651201511320][β0β1β2]=[40.00050.00060.00040.00090.000]y = X \cdot \beta = \begin{bmatrix} 1 & 10 & 1 \\ 1 & 13 & 4 \\ 1 & 16 & 5 \\ 1 & 20 & 15 \\ 1 & 13 & 20 \\ \end{bmatrix} \cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \end{bmatrix} = \begin{bmatrix} 40.000 \\ 50.000 \\ 60.000 \\ 40.000 \\ 90.000 \\ \end{bmatrix}

  • Observation 1
    • y1=β0+β1x1,1+β2x1,2y_1=\beta_0 + \beta_1 \cdot x_{1,1} + \beta_2 \cdot x_{1,2}
    • 40.000=β0+β110+β2140.000=\beta_0 + \beta_1 \cdot 10 + \beta_2 \cdot 1
  • Observation 2
    • y2=β0+β1x2,1+β2x2,2y_2=\beta_0 + \beta_1 \cdot x_{2,1} + \beta_2 \cdot x_{2,2}
    • 50.000=β0+β113+β2450.000= \beta_0 + \beta_1 \cdot 13 + \beta_2 \cdot 4
Bio Data-Science
Parametric Methods
  • incomeβ0+β1×education+β2×seniority\text{income} ≈ β_0 + β_1 × \text{education}+ β_2 × \text{seniority}
  • form is fixed, but we can tweak three parameters to improve the fit (β0,β1,β2β_0, β_1, β_2)
  • β0β_0 is the intercept
Bio Data-Science

incomeβ0+β1education+β2seniority\text{income} ≈ β_0 + β_1 \cdot \text{education}+ β_2 \cdot \text{seniority}

income20k€+1.5k€aeducation+1k€aseniority\text{income} ≈ 20 \text{k€} + 1.5 \frac{\text{k€}}{\text{a}} \cdot \text{education}+ 1 \frac{\text{k€}}{\text{a}} \cdot \text{seniority}

  • β0\beta_0: uneducated workers without eduction get a salary of 20.000 €20.000\text{ €}

  • β1\beta_1: one year of education results in 1.500 €1.500\text{ €} higher salary

  • β2\beta_2: one year of working experience results in 1.0001.000\text{€} higher salary

  • linear regression is a parametric and relatively inflexible approach

    • it can only generate linear functions
    • only two factors (lines) to interpret
Bio Data-Science

✍️ Task

2.1 Storing YY and XX in a DataFrame

Bio Data-Science
Non-parametric Methods
  • no explicit assumption about the functional form of ff
  • but some kind of algorithm of ff
    • close to the data points
    • without being too rough or wiggly
  • e.g., a hand-drawn line is a non-parametric model
Bio Data-Science
🧠 Over-fitting
  • more common with flexible, non-parametric methods
  • perfect prediction for any point in the training data
  • probably not a good prediction for points that are not in the training data
Bio Data-Science
🧠 Sample split in Training and Test Set
  • Training data: the data we use for building the mode (e.g., finding /fitting the right parameters for the model)
  • Test data: Hold-out sample, that we can use to test how well the model performs on unseen data
    • more important with prediction tasks
Bio Data-Science

Trade-Off Between Prediction Accuracy and Model Interpretability

  • parametric models are usually easier to interpret
Bio Data-Science
Which to use for prediction and interpretation?
  • Non-parametric model tend to be more powerful
    and therefore better for predictions
  • Parametric models are less flexible but have
    clear parameters (β\beta sometimes called θ\theta) that are
    easier to understand for interpretation
Bio Data-Science

✍️ Task

2.2 Creating a Training and Test set

Bio Data-Science

Assessing Model Accuracy

  • There is no free lunch in statistics:
    no one method dominates all others over all
    possible data sets.
  • How to decide for any given set of data which method produces the best results?
https://xkcd.com/2048/
Bio Data-Science
🧠 Measuring the Quality of Fit

Y=f(X)+ϵY = f(\vec{X}) + \epsilon

  • the error of each prediction eje_j tells us how well the predictions match the observed data
  1. ej=yjf^(xj)e_j = y_j-\hat{f}(\vec{x}_j)

  • Mean squared error (MSE\text{MSE}) is a common accuracy measure,
  1. MSE=1nj=1nej2=1nj=1n(yjf^(xj))2\text{MSE} = \frac{1}{n} \sum_{j=1}^ne_j^2 = \frac{1}{n} \sum_{j=1}^n(y_j-\hat{f}(\vec{x}_j))^2

Bio Data-Science
🧠 Training vs Test Set
  • Many methods minimize the MSE\text{MSE} during training (training MSE\text{MSE}, in-sample MSE\text{MSE})
  • in general, we do not really care how well the method works during training on the training data,
  • but how our methods works on previously unseen test data (out-of-sample MSE\text{MSE})
Bio Data-Science
Fitting Data with a low flexibility Model
  • flow(x)=y^(x)=β0f_{\text{low}}(x)=\hat{y}(x)=\beta_0

    • (i.e., the mean of all red dots of the training data)
    • only one parameter β0\beta_0
  • black: prediction

  • red: Training data (medium fit)

  • blue: Test data (good fit)

Bio Data-Science
Fitting Data with a medium flexibility Model
  • fmed(x)=y^(x)=β0+β1xf_{\text{med}}(x)=\hat{y}(x)=\beta_0 + \beta_1 \cdot x

    • two parameters β0\beta_0 and β1\beta_1
  • black: prediction

  • red: Training data (good fit)

  • blue: Test data (good fit)

Bio Data-Science
Fitting Data with a high flexibility Model
  • fhigh(x)=y^(x)=β0+β1x+β2x2f_{\text{high}}(x)=\hat{y}(x)\\=\beta_0 + \beta_1 \cdot x + \beta_2 \cdot x^2
    • three parameters β0\beta_0, β1\beta_1, β2\beta_2
  • black: prediction
  • red: Training data (very good fit)
  • blue: Test data (poor fit)
Bio Data-Science

✍️ Task

2.3 Training Models on a Test Set

Bio Data-Science
The Bias-Variance Trade-Off and Over-Fitting

  • The higher the flexibility
    • the lower the MSEMSE on the training data
    • the higher the MSEMSE on the test data
Bio Data-Science

Expected Error2=Variance of the Model+Bias2+Variance of the Error Terms\text{Expected Error}^2 = \text{Variance of the Model} + \text{Bias}^2 + \text{Variance of the Error Terms}

  • U-shape in the test MSE is the result of two competing properties of statistical learning methods
  • Variance refers to the amount by which y^\hat{y} would change
    if we estimated it using a different training data set.
  • Bias refers to the error that is introduced by approximating
    a real-life problem, which may be extremely complicated,
    by a much simpler model.
See Hasties p. 34 correct formula
Bio Data-Science
  • Black line: Error on the test set
  • more flexible (complex, often non-parametric) models have higher variance but lower bias
  • they can fit the data better (low bias), but are more influenced by the data (high variance)
Bio Data-Science

✍️ Task

2.4 Preventing Non-Parametric Models from Over-Fitting

Bio Data-Science