2.7 Classification

Bio Data-Science

2.7.1 Classification - Logit-Regression

🎯 Learning objectives

You will be able to

  • describe what odds are
  • interpret the results of a logistic regression model
Bio Data-Science

Classification: the response variable is qualitative

  • Predicted variable YY: Will the debitor default
  • Predictors:
    • X1X_1: What is their current account balance
    • X2X_2: What is their income
Bio Data-Science
Examples of classifications
  • Binary classification: Will the creditor default?
  • Multi-class classification: Which species of penguin is it?
    • Any multi-class classification can be broken down into several binary classifications:
      • is it Gentoo?
      • is it Adelie?
      • is it Chinstrap?
https://www.scoopnest.com/user/AFP/1035147372572102656-do-you-know-your-gentoo-from-your-adelie-penguins-infographic-on-10-of-the-world39s-species-after
Bio Data-Science
Examples of multi-class classifications

https://medium.com/@Suraj_Yadav/in-depth-knowledge-of-convolutional-neural-networks-b4bfff8145ab
Bio Data-Science

Logistic Regression Model

🧠 Why Not Linear Regression?

  • Probability that the creditor will default: p(Y=1)p(Y=1)
  • left: p(Y=1)=β0+β1Xp(Y=1)=\beta_0+\beta_1 X
    • negative probability if low balance
    • high balances will result in probability greater than 0
Bio Data-Science

  • adaption of the models form
  • right: p(Y)=eβ0+β1X1+eβ0+β1Xp(Y)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}
Bio Data-Science

🤓 Solving Logistic Regression

  1. p(Y)=eβ0+β1X1+eβ0+β1Xp(Y)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}

  2. p(Y)1p(Y)=eβ0+β1X\frac{p(Y)}{1-p(Y)}=e^{\beta_0+\beta_1 X}

Bio Data-Science
🤓 Odds
  • are just another way to speaking of probabilities

p(Y)1p(Y)\frac{p(Y)}{1-p(Y)}

Bio Data-Science
  • The odds of Bayern Munich winning are 2 to 1
    • p(Y)1p(Y)=21\frac{p(Y)}{1-p(Y)}=\frac{2}{1}
    • It's twice as likely that they win, than tie or loosing
    • p(Y)1p(Y)=0.6610.66\frac{p(Y)}{1-p(Y)}=\frac{0.66}{1-0.66}
Bio Data-Science

p(Y)1p(Y)=eβ0+β1X\frac{p(Y)}{1-p(Y)}=e^{\beta_0+\beta_1 X}

logp(Y)1p(Y)=β0+β1X\log{\frac{p(Y)}{1-p(Y)}}=\beta_0+\beta_1 X

  • The left-hand side is called the log-odds or logit. We see that the logistic regression model has a logit that is linear in XX.
  • the logarithm of the odds is a linear function that can be solved with
    • least squares approach
    • maximum likelihood method
    • gradient descent
Bio Data-Science

Interpretation of the Regression Table

  • the logarithm of the odds of having a default (y=1y=1)
    is explained with a linear model

logp(Y)1p(Y)=β0+β1X\log{\frac{p(Y)}{1-p(Y)}}=\beta_0+\beta_1 X

  • balance (β1>0\beta_1>0) has a positive effect in the probability to default
  • this effect is significant p-value<0.05p\text{-value}<0.05
  • the intercept is harder to interpret
Bio Data-Science

🧠 Making a Prediction

  • We can just plug in the values of predictors
    and get a probability value (between 00 and 11 as a prediction)
  • Probability of a default with
    • a balance of x1=1000x_1=1000
      p(Y)=eβ0+β1X1+eβ0+β1X=e10.65+0.005510001+e10.65+0.00551000=0.00576p(Y)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}=\frac{e^{-10.65+0.0055 \cdot 1000}}{1+e^{-10.65+0.0055 \cdot 1000}}=0.00576
Bio Data-Science

✍️ Task

Given the following Regression Model of the odds of having high blood pressure from Lavie et al. (BMJ, 2000) surveyed 26772677 adults referred to a sleep clinic

Risk factor Coefficient p-value
Age (1010 years) 0.805 0.04
Sex (male) 0.161 0.03
BMI (5kg/m25kg/m^2) 0.332 0.04
Apnoe Index (1010 units) 0.116 0.23
  • Does the model have an intercept?
  • Is the apnoea index significantly predictive of high blood pressure?
  • Is sex a predictor of high blood pressure?
data is changed from the original paper, https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/multiple-logistic-regression
Bio Data-Science
  • Does the model have an intercept?
    • no
  • Is the apnoea index predictive of high blood pressure?
    • no significance as p>0.05p>0.05
  • Is sex a predictor of high blood pressure?
    • males have significantly higher odds of having high blood pressure?
Bio Data-Science

✍️ Case Study

  • We will use deep neural networks to classify images of birds and trees

Image Classification with Deep Learning

Bio Data-Science

2.7.2 Classification - Evaluating Classification Results

🎯 Learning objectives

You will be able to

  • read a confusion matrix and calculate classification errors based on its results
  • set classification thresholds to generate a ROC-curve of a classifier
  • compare models using ROC and AUC
Bio Data-Science

No prediction is perfect

  • Predicted variable YY: Will the debitor default
  • Predictors:
    • X1X_1: What is their current account balance
Bio Data-Science
YY (default ==True) X1X_1 balance p^(Y)\hat{p}(Y)
1 4000 0.667
0 2000 0.587
... ...
Bio Data-Science

  • Probability of a default for the second person with
    • a balance of x1=2000x_1=2000
      p(X)=eβ0+β1X1+eβ0+β1X=e10.65+0.005520001+e10.65+0.00552000=0.587p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}=\frac{e^{-10.65+0.0055 \cdot 2000}}{1+e^{-10.65+0.0055 \cdot 2000}}=0.587
    • We can decide to predict default based on the threshold probability of 50%50\%
    • Still the second person did not default
Bio Data-Science

Confusion Matrix

  • making a prediction, we will make errors
  • with categorical data we cannot use the accuracy measures from regression
Bio Data-Science
https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
  • True Positive (TPTP):
    • You predicted positive and it’s true.
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TNTN) PP
Total NN^* PP^*
  • True Negative (TNTN):
    • You predicted negative and it’s true.
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
  • False Positive (FPFP): (Type 1 Error):
    • You predicted positive and it’s false.
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
  • False Negative (FNFN): (Type 2 Error):
    • You predicted negative and it’s false.
Bio Data-Science
Example: Corona-Test
  • What error do we want to minimize, when everyone who is sick should stay at home?
    • We want to to find all positives (PP)
    • False negatives (FNFN) are dangerous
Bio Data-Science

🧠

has Corona yy test result p^(y)\hat{p}(y) Classification for threshold 0.5 Error-Type
0 0.4 0 TNTN
1 0.9 1 TPTP
0 0.7 1 FPFP
1 0.7 1 TPTP
0 0.3 0 TNTN
1 0.4 0 FNFN
  • A threshold is a probability value we set to decide on the predicted classification based on the predicted probability
Bio Data-Science

🧠

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=2TN=2 FP=1FP=1 N=3N=3
Values Positive (1) FN=1FN=1 TP=2TP=2 P=3P=3
Total N=3N^*=3 P=3P^*=3
  • There is one false negative that is not detected to have corona with this threshold
Bio Data-Science

✍️ Task

  • Select a threshold, so that no false negative remains
  • Fill the confusion matrix
    ⌛ 15 minutes
has Corona yy test result y^\hat{y} Classification for threshold ... Error-Type
0 0.4 TNTN
1 0.9 TPTP
0 0.7 FPFP
1 0.7 TPTP
0 0.3 TNTN
1 0.4 FNFN
Bio Data-Science
has Corona yy test result y^\hat{y} Classification for threshold 0.4 Error-Type
0 0.4 1 FPFP
1 0.9 1 TPTP
0 0.7 1 FPFP
1 0.7 1 TPTP
0 0.3 0 TNTN
1 0.4 1 TPTP
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=TN= FP=FP= N=3N=3
Values Positive (1) FN=FN= TP=TP= P=3P=3
Total N=N^*= P=P^*=
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=1TN=1 FP=2FP=2 N=3N=3
Values Positive (1) FN=0FN=0 TP=3TP=3 P=3P=3
Total N=3N^*=3 P=3P^*=3
  • No, we find all positives
  • However, we send one more person home as a false positive
Bio Data-Science

🧠 Thresholds

  • In most classification problem, we have to balance (at least) to different errors
    • False Positive (FPFP): (Type 1 Error 🤓)
    • False Negative (FNFN): (Type 2 Error 🤓)
  • By moving the thresholds, we can calibrate a classifier (that predict a class probability) so that is suits the use case
  • To compare classifiers, we can use a receiver operating characteristic
Bio Data-Science

🧠 Receiver operating characteristic

  • to create a line, we use the same model but change the threshold
  • how much more FPFP do we get, if we increase the TPTP rate by changing the threshold?
  • Sensitivity / Recall:
    TPR=TPTP+FN=TPPTPR=\frac{TP}{TP+FN} = \frac{TP}{P}
  • FPR=FPFP+TN=FPNFPR=\frac{FP}{FP+TN}=\frac{FP}{N}
  • the perfect classifier would achieve all TPTP without any FPFP
Bio Data-Science
Good Classifier
  • red: distribution of the predictions for negatives
  • green: distribution of the predictions for positives
  • only few of the guesses land on the wrong side of the threshold
https://towardsdatascience.com/demystifying-roc-curves-df809474529a
Bio Data-Science
Random Classifier
  • the classifier has to power to differentiate between the classes
  • no matter where we place the threshold, we will get the same number of FPFP and TPTP
https://towardsdatascience.com/demystifying-roc-curves-df809474529a
Bio Data-Science
🧠 AUC (Area Under the ROC Curve)
  • We can measure the area under the ROC-curve to evaluate the skill of a model
  • A perfect classifier has a AUC of 11
  • a random classifier has a AUC of 0.50.5
https://towardsdatascience.com/understanding-the-roc-curve-and-auc-dd4f9a192ecb
Bio Data-Science

Other Accuracy Measures for Classifiers

  • for a model and a given threshold, we have the the confusion matrix to get a first impression of it's performance
  • it is helpful to have a single metric to describe a models accuracy
https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
Bio Data-Science
Accuracy

What proportion identifications were correct?
How often do we hit the target?

accuracy =TP+TNTP+TN+FP+FN\text{accuracy } = \frac{TP+TN}{TP +TN + FP + FN}

  • 🤓 You don't have to learn the formulas, but should be able to calculate the values if formulas and data are given
Bio Data-Science
Precision

What proportion of positive identifications was actually correct?
Of everything we predict to be positive, how many are really positive?

precision=TPTP+FP\text{precision} = \frac{TP}{TP + FP}

Bio Data-Science
Sensitivity / Recall

What proportion of actual positives was identified correctly?
How many of the positives do we find?

recall=TPTP+FN\text{recall} = \frac{TP}{TP + FN}

Bio Data-Science
F1-Score
  • The F1 score is the harmonic mean of the precision and recall.

F1=2precisionrecallprecision+recallF_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

Bio Data-Science
✍️ Task
  • Calculate accuracy, precision, recall and F1-Score for the following examples

⌛ 10 minutes

Bio Data-Science

Corona Test I

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=2TN=2 FP=1FP=1 N=3N=3
Values Positive (1) FN=1FN=1 TP=2TP=2 P=3P=3
Total N=3N^*=3 P=3P^*=3
Bio Data-Science

Corona Test II

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=998TN=998 FP=0FP=0 N=998N=998
Values Positive (1) FN=1FN=1 TP=1TP=1 P=2P=2
Total N=999N^*=999 P=1P^*=1
Bio Data-Science

Corona Test I

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=2TN=2 FP=1FP=1 N=3N=3
Values Positive (1) FN=1FN=1 TP=2TP=2 P=3P=3
Total N=3N^*=3 P=3P^*=3
  • accuracy=TP+TNP+N=2+23+3=23\text{accuracy} = \frac{TP + TN}{P + N}= \frac{2 + 2}{3 + 3}=\frac{2}{3}
  • precision=TPTP+FP=22+1=23\text{precision} = \frac{TP}{TP + FP}= \frac{2}{2 + 1}=\frac{2}{3}
  • recall=TPTP+FN=22+1=23\text{recall} = \frac{TP}{TP + FN}=\frac{2}{2 + 1}=\frac{2}{3}
  • F1=2precisionrecallprecision+recall=2232323+23=23F_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}=2 \frac{\frac{2}{3} \frac{2}{3}}{\frac{2}{3} + \frac{2}{3}}=\frac{2}{3}
Bio Data-Science

Corona Test II

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=998TN=998 FP=0FP=0 N=998N=998
Values Positive (1) FN=1FN=1 TP=1TP=1 P=2P=2
Total N=999N^*=999 P=1P^*=1
  • accuracy=TP+TNP+N=988+1998+2=0.999\text{accuracy} = \frac{TP + TN}{P + N}= \frac{988 + 1}{998 + 2}=0.999
  • precision=TPTP+FP=11+0=1\text{precision} = \frac{TP}{TP + FP}= \frac{1}{1 + 0}=1
  • recall=TPTP+FN=11+1=12\text{recall} = \frac{TP}{TP + FN}=\frac{1}{1 + 1}=\frac{1}{2}
  • F1=2precisionrecallprecision+recall=21121+12=23F_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}=2 \frac{1 \cdot \frac{1}{2}}{1 + \frac{1}{2}}=\frac{2}{3}
Bio Data-Science

🧠 Unbalanced Data Sets

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TNTN FPFP N=998N=998
Values Positive (1) FNFN TPTP P=2P=2
Total NN^* PP^*

Given this data. Is it hard to train an model with high accuracy (high share of correct predictions)?

  • The simple model
    • P(x=0)=1P(x=0)=1
    • always predicts negative
Bio Data-Science
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=998TN=998 FP=0FP=0 N=998N=998
Values Positive (1) FN=2FN=2 TP=0TP=0 P=2P=2
Total N=1000N^*=1000 P=0P^*=0
  • Datasets, where one group is much more common that the other are called unbalanced
  • This can lead to a special case of over-fitting (as in this example)
Bio Data-Science
🤓 Training on Unbalanced Data Sets
  • a common way to solve this problem is to create a balanced training data set
  • depending on the amount of data available
    • under-sampling: omit some instances of the majority class in the training
    • over-sampling: multiply some instances of the minority class in the training

https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392
Bio Data-Science