2.7 Classification

2.7.1 Classification - Logit-Regression

🎯 Learning objectives

You will be able to

  • describe what odds are
  • interpret the results of a logistic regression model
Classification: the response variable is qualitative

  • Predicted variable YY: Will the debitor default
  • Predictors:
    • X1X_1: What is their current account balance
    • X2X_2: What is their income
Examples of classifications
  • Binary classification: Will the creditor default?
  • Multi-class classification: Which species of penguin is it?
    • Any multi-class classification can be broken down into several binary classifications:
      • is it Gentoo?
      • is it Adelie?
      • is it Chinstrap?
Examples of multi-class classifications

Logistic Regression Model

🧠 Why Not Linear Regression?

  • Probability that the creditor will default: p(Y=1)p(Y=1)
  • left: p(Y=1)=β0+β1Xp(Y=1)=\beta_0+\beta_1 X
    • negative probability if low balance
    • high balances will result in probability greater than 0
  • adaption of the models form
  • right: p(Y)=eβ0+β1X1+eβ0+β1Xp(Y)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}
🤓 Solving Logistic Regression

  1. p(Y)=eβ0+β1X1+eβ0+β1Xp(Y)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}

  2. p(Y)1p(Y)=eβ0+β1X\frac{p(Y)}{1-p(Y)}=e^{\beta_0+\beta_1 X}

🤓 Odds
  • are just another way to speaking of probabilities


  • The odds of Bayern Munich winning are 2 to 1
    • p(Y)1p(Y)=21\frac{p(Y)}{1-p(Y)}=\frac{2}{1}
    • It's twice as likely that they win, than tie or loosing
    • p(Y)1p(Y)=0.6610.66\frac{p(Y)}{1-p(Y)}=\frac{0.66}{1-0.66}
p(Y)1p(Y)=eβ0+β1X\frac{p(Y)}{1-p(Y)}=e^{\beta_0+\beta_1 X}

logp(Y)1p(Y)=β0+β1X\log{\frac{p(Y)}{1-p(Y)}}=\beta_0+\beta_1 X

  • The left-hand side is called the log-odds or logit. We see that the logistic regression model has a logit that is linear in XX.
  • the logarithm of the odds is a linear function that can be solved with
    • least squares approach
    • maximum likelihood method
    • gradient descent
Interpretation of the Regression Table

  • the logarithm of the odds of having a default (y=1y=1)
    is explained with a linear model

logp(Y)1p(Y)=β0+β1X\log{\frac{p(Y)}{1-p(Y)}}=\beta_0+\beta_1 X

  • balance (β1>0\beta_1>0) has a positive effect in the probability to default
  • this effect is significant p-value<0.05p\text{-value}<0.05
  • the intercept is harder to interpret
🧠 Making a Prediction

  • We can just plug in the values of predictors
    and get a probability value (between 00 and 11 as a prediction)
  • Probability of a default with
    • a balance of x1=1000x_1=1000
      p(Y)=eβ0+β1X1+eβ0+β1X=e10.65+0.005510001+e10.65+0.00551000=0.00576p(Y)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}=\frac{e^{-10.65+0.0055 \cdot 1000}}{1+e^{-10.65+0.0055 \cdot 1000}}=0.00576
✍️ Task

Given the following Regression Model of the odds of having high blood pressure from Lavie et al. (BMJ, 2000) surveyed 26772677 adults referred to a sleep clinic

Risk factor Coefficient p-value
Age (1010 years) 0.805 0.04
Sex (male) 0.161 0.03
BMI (5kg/m25kg/m^2) 0.332 0.04
Apnoe Index (1010 units) 0.116 0.23
  • Does the model have an intercept?
  • Is the apnoea index significantly predictive of high blood pressure?
  • Is sex a predictor of high blood pressure?
data is changed from the original paper, https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/multiple-logistic-regression
  • Does the model have an intercept?
    • no
  • Is the apnoea index predictive of high blood pressure?
    • no significance as p>0.05p>0.05
  • Is sex a predictor of high blood pressure?
    • males have significantly higher odds of having high blood pressure?
✍️ Case Study

  • We will use deep neural networks to classify images of birds and trees

Image Classification with Deep Learning

2.7.2 Classification - Evaluating Classification Results

🎯 Learning objectives

You will be able to

  • read a confusion matrix and calculate classification errors based on its results
  • set classification thresholds to generate a ROC-curve of a classifier
  • compare models using ROC and AUC
No prediction is perfect

  • Predicted variable YY: Will the debitor default
  • Predictors:
    • X1X_1: What is their current account balance
YY (default ==True) X1X_1 balance p^(Y)\hat{p}(Y)
1 4000 0.667
0 2000 0.587
... ...
  • Probability of a default for the second person with
    • a balance of x1=2000x_1=2000
      p(X)=eβ0+β1X1+eβ0+β1X=e10.65+0.005520001+e10.65+0.00552000=0.587p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}=\frac{e^{-10.65+0.0055 \cdot 2000}}{1+e^{-10.65+0.0055 \cdot 2000}}=0.587
    • We can decide to predict default based on the threshold probability of 50%50\%
    • Still the second person did not default
Confusion Matrix

  • making a prediction, we will make errors
  • with categorical data we cannot use the accuracy measures from regression
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
  • True Positive (TPTP):
    • You predicted positive and it’s true.
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TNTN) PP
Total NN^* PP^*
  • True Negative (TNTN):
    • You predicted negative and it’s true.
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
  • False Positive (FPFP): (Type 1 Error):
    • You predicted positive and it’s false.
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) True Negative (TNTN) False Positive (FPFP) NN
Values Positive (1) False Negative (FNFN) True Positive (TPTP) PP
Total NN^* PP^*
  • False Negative (FNFN): (Type 2 Error):
    • You predicted negative and it’s false.
Example: Corona-Test
  • What error do we want to minimize, when everyone who is sick should stay at home?
    • We want to to find all positives (PP)
    • False negatives (FNFN) are dangerous
has Corona yy test result p^(y)\hat{p}(y) Classification for threshold 0.5 Error-Type
0 0.4 0 TNTN
1 0.9 1 TPTP
0 0.7 1 FPFP
1 0.7 1 TPTP
0 0.3 0 TNTN
1 0.4 0 FNFN
  • A threshold is a probability value we set to decide on the predicted classification based on the predicted probability
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=2TN=2 FP=1FP=1 N=3N=3
Values Positive (1) FN=1FN=1 TP=2TP=2 P=3P=3
Total N=3N^*=3 P=3P^*=3
  • There is one false negative that is not detected to have corona with this threshold
✍️ Task

  • Select a threshold, so that no false negative remains
  • Fill the confusion matrix
    ⌛ 15 minutes
has Corona yy test result y^\hat{y} Classification for threshold ... Error-Type
0 0.4 TNTN
1 0.9 TPTP
0 0.7 FPFP
1 0.7 TPTP
0 0.3 TNTN
1 0.4 FNFN
has Corona yy test result y^\hat{y} Classification for threshold 0.4 Error-Type
0 0.4 1 FPFP
1 0.9 1 TPTP
0 0.7 1 FPFP
1 0.7 1 TPTP
0 0.3 0 TNTN
1 0.4 1 TPTP
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=TN= FP=FP= N=3N=3
Values Positive (1) FN=FN= TP=TP= P=3P=3
Total N=N^*= P=P^*=
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=1TN=1 FP=2FP=2 N=3N=3
Values Positive (1) FN=0FN=0 TP=3TP=3 P=3P=3
Total N=3N^*=3 P=3P^*=3
  • No, we find all positives
  • However, we send one more person home as a false positive
🧠 Thresholds

  • In most classification problem, we have to balance (at least) to different errors
    • False Positive (FPFP): (Type 1 Error 🤓)
    • False Negative (FNFN): (Type 2 Error 🤓)
  • By moving the thresholds, we can calibrate a classifier (that predict a class probability) so that is suits the use case
  • To compare classifiers, we can use a receiver operating characteristic
🧠 Receiver operating characteristic

  • to create a line, we use the same model but change the threshold
  • how much more FPFP do we get, if we increase the TPTP rate by changing the threshold?
  • Sensitivity / Recall:
    TPR=TPTP+FN=TPPTPR=\frac{TP}{TP+FN} = \frac{TP}{P}
  • FPR=FPFP+TN=FPNFPR=\frac{FP}{FP+TN}=\frac{FP}{N}
  • the perfect classifier would achieve all TPTP without any FPFP
Good Classifier
  • red: distribution of the predictions for negatives
  • green: distribution of the predictions for positives
  • only few of the guesses land on the wrong side of the threshold
Random Classifier
  • the classifier has to power to differentiate between the classes
  • no matter where we place the threshold, we will get the same number of FPFP and TPTP
🧠 AUC (Area Under the ROC Curve)
  • We can measure the area under the ROC-curve to evaluate the skill of a model
  • A perfect classifier has a AUC of 11
  • a random classifier has a AUC of 0.50.5
Other Accuracy Measures for Classifiers

  • for a model and a given threshold, we have the the confusion matrix to get a first impression of it's performance
  • it is helpful to have a single metric to describe a models accuracy
What proportion identifications were correct?
How often do we hit the target?

accuracy =TP+TNTP+TN+FP+FN\text{accuracy } = \frac{TP+TN}{TP +TN + FP + FN}

  • 🤓 You don't have to learn the formulas, but should be able to calculate the values if formulas and data are given
What proportion of positive identifications was actually correct?
Of everything we predict to be positive, how many are really positive?

precision=TPTP+FP\text{precision} = \frac{TP}{TP + FP}

Sensitivity / Recall

What proportion of actual positives was identified correctly?
How many of the positives do we find?

recall=TPTP+FN\text{recall} = \frac{TP}{TP + FN}

  • The F1 score is the harmonic mean of the precision and recall.

F1=2precisionrecallprecision+recallF_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

✍️ Task
  • Calculate accuracy, precision, recall and F1-Score for the following examples

⌛ 10 minutes

Corona Test I

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=2TN=2 FP=1FP=1 N=3N=3
Values Positive (1) FN=1FN=1 TP=2TP=2 P=3P=3
Total N=3N^*=3 P=3P^*=3
Corona Test II

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=998TN=998 FP=0FP=0 N=998N=998
Values Positive (1) FN=1FN=1 TP=1TP=1 P=2P=2
Total N=999N^*=999 P=1P^*=1
Corona Test I

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=2TN=2 FP=1FP=1 N=3N=3
Values Positive (1) FN=1FN=1 TP=2TP=2 P=3P=3
Total N=3N^*=3 P=3P^*=3
  • accuracy=TP+TNP+N=2+23+3=23\text{accuracy} = \frac{TP + TN}{P + N}= \frac{2 + 2}{3 + 3}=\frac{2}{3}
  • precision=TPTP+FP=22+1=23\text{precision} = \frac{TP}{TP + FP}= \frac{2}{2 + 1}=\frac{2}{3}
  • recall=TPTP+FN=22+1=23\text{recall} = \frac{TP}{TP + FN}=\frac{2}{2 + 1}=\frac{2}{3}
  • F1=2precisionrecallprecision+recall=2232323+23=23F_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}=2 \frac{\frac{2}{3} \frac{2}{3}}{\frac{2}{3} + \frac{2}{3}}=\frac{2}{3}
Corona Test II

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=998TN=998 FP=0FP=0 N=998N=998
Values Positive (1) FN=1FN=1 TP=1TP=1 P=2P=2
Total N=999N^*=999 P=1P^*=1
  • accuracy=TP+TNP+N=988+1998+2=0.999\text{accuracy} = \frac{TP + TN}{P + N}= \frac{988 + 1}{998 + 2}=0.999
  • precision=TPTP+FP=11+0=1\text{precision} = \frac{TP}{TP + FP}= \frac{1}{1 + 0}=1
  • recall=TPTP+FN=11+1=12\text{recall} = \frac{TP}{TP + FN}=\frac{1}{1 + 1}=\frac{1}{2}
  • F1=2precisionrecallprecision+recall=21121+12=23F_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}=2 \frac{1 \cdot \frac{1}{2}}{1 + \frac{1}{2}}=\frac{2}{3}
🧠 Unbalanced Data Sets

Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TNTN FPFP N=998N=998
Values Positive (1) FNFN TPTP P=2P=2
Total NN^* PP^*

Given this data. Is it hard to train an model with high accuracy (high share of correct predictions)?

  • The simple model
    • P(x=0)=1P(x=0)=1
    • always predicts negative
Predicted Values Total
Negative (0) Positive (1)
Actual Negative (0) TN=998TN=998 FP=0FP=0 N=998N=998
Values Positive (1) FN=2FN=2 TP=0TP=0 P=2P=2
Total N=1000N^*=1000 P=0P^*=0
  • Datasets, where one group is much more common that the other are called unbalanced
  • This can lead to a special case of over-fitting (as in this example)
🤓 Training on Unbalanced Data Sets
  • a common way to solve this problem is to create a balanced training data set
  • depending on the amount of data available
    • under-sampling: omit some instances of the majority class in the training
    • over-sampling: multiply some instances of the minority class in the training

