2.6 Learning

Bio Data-Science

2.6.1 Gradient Descent

🎯 Learning objectives

You will be able to

  • explain how gradient descent works describing it in an pseudo algorithm
  • spot common problems to gradient descend and can provide solutions
  • use learning curves to identify over-fitting
Bio Data-Science

Cost Functions

  • Most models train, by minimizing a cost function on the training data to get good estimators for the parameters β^i\hat{β}_i

y^β^0+β^1x1\hat{y} ≈ \hat{β}_0 + \hat{β}_1 \cdot x_1

  • A typical cost function is the RSSRSS (i.e., the sum of squared prediction errors) over all nn observations

RSS=j=1n(yjyj^)2RSS=\sum^n_{j=1}(y_j-\hat{y_j})^2

Bio Data-Science
Minimizing Cost Functions
  • We saw that there is an analytical solution to minimize RSSRSS
    • so we can just calculate the optimal parameters βi\beta_i
  • But we could also calculate the result of the cost function for different parameter values to approach a minimum in the cost function

https://medium.com/analytics-vidhya/cost-function-explained-in-less-than-5-minutes-c5d8a44b918c
Bio Data-Science
🧠 Minimizing Complex Cost Functions
  • θi\theta_i ... parameters in a model, that is no linear regression (think βi\beta_i)
  • J(θ)J(\vec{\theta}) ... cost function in any machine learning model (think RSSRSS)
https://medium.com/analytics-vidhya/cost-function-explained-in-less-than-5-minutes-c5d8a44b918c
  • We aim to find a global minimum
  • How can we do this?
Bio Data-Science

Gradient Descent Heuristic

  • Start at a random position in the cost function
  • Calculate the slope of the cost function at the current position
  • If the slope is negative, move to the right
  • Repeat
Bio Data-Science

🧠

  • Example with only one parameter θ\theta (theta)
  • and only one local minimum of the cost function JJ that is the global minimum, where J=0J' = 0
    start at random position theta in the cost function
    
    while J'(theta)>0:
        Calculate the slope of J: J'(theta) at the current position
        If slope is negative:
            move theta to right by step width alpha
        If slope is positive:
            move theta to left by step width alpha
    
  • the step with α\alpha is called learning rate
Bio Data-Science
Gradient
  • more dimensional derivative
  • gives us the direction of the steepest incline
  • which weight θ\theta has the biggest influence on the cost

J=Jθ=[Jθ0...Jθp]\nabla J = \frac{\partial J}{\partial \theta} = \begin{bmatrix} \frac{\partial J}{\partial \theta_0} \\ ... \\ \frac{\partial J}{\partial \theta_p} \end{bmatrix}

  • with only one θ\theta
    J(θ)=dJdθ=J\nabla J (\theta) = \frac{dJ}{d \theta} = J'
Bio Data-Science
Example with a single θ=θ1\theta=\theta_1
  • Example data

    Data points: [4,2][4,2] and [4,4][4,4]
    y=[y1y2]=[24]y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 4 \end{bmatrix}, X=[x1,1x2,1]=[44]X = \begin{bmatrix} x_{1,1} \\ x_{2,1} \end{bmatrix} = \begin{bmatrix} 4 \\ 4 \end{bmatrix}

  • We want to fit a line with no intercept
    y^j=θ1xj,1\hat{y}_j=\theta_1 \cdot x_{j,1}

Bio Data-Science
  • we can write down the cost function:

    • J(θ)=RSS=j=1n(yjyj^)2J(\theta) = RSS=\sum^n_{j=1}(y_j-\hat{y_j})^2
    • =j=1n(yjθ1xj,1)2=\sum^n_{j=1}(y_j-\theta_1 \cdot x_{j,1})^2
  • we can calculate the derivative

    • J(θ)=dJdθ\nabla J (\theta) = \frac{dJ}{d \theta}
    • =j=1n2xj,1(yjθ1xj,1)=\sum^n_{j=1} -2x_{j,1}(y_j-\theta_1 \cdot x_{j,1})
Bio Data-Science
  • what if we use flat line (θ=0\theta=0)
    • J(θ=0)=j=1n(yjθ1xj,1)2=(20)2+(40)2=20J(\theta = 0) =\sum^n_{j=1}(y_j-\theta_1 \cdot x_{j,1})^2 =(2-0)^2+(4-0)^2 = 20
    • J(θ=0)=j=1n2xj,1(yjθ1xj,1)\nabla J(\theta = 0) =\sum^n_{j=1} -2x_{j,1}(y_j-\theta_1 \cdot x_{j,1})
      =(24(2)+24(4)=48=-(2 \cdot 4 (2)+ 2 \cdot 4 (4) =-48
  • We make a large error (J(θ)=20J(\theta)=20)
  • The slope is negative (J(θ)=48J'(\theta)=-48), so we must increase θ\theta
Bio Data-Science
  • what if we use use line with incline of 11 (θ=1\theta=1)
    • J(θ=1)=j=1n(yjθ1xj,1)2=(214)2+(414)2=4J(\theta = 1) =\sum^n_{j=1}(y_j-\theta_1 \cdot x_{j,1})^2 =(2-1 \cdot 4)^2+(4-1\cdot 4)^2 =4
    • J(θ=1)=j=1n2xj,1(yjθ1xj,1)\nabla J(\theta =1) =\sum^n_{j=1} -2x_{j,1}(y_j-\theta_1 \cdot x_{j,1})
      =(24(24)+24(44)=16=-(2 \cdot 4 (2-4)+ 2 \cdot 4 (4-4) =16
  • We make a much smaller error (J(θ)=4J(\theta)=4)
  • The slope is positive (J(θ)=16J'(\theta)=16), so we must decrease θ\theta
Bio Data-Science
🧠 Key Takeaways
  • if we have any cost function J(θ)J(\theta) (error measure) that is derivable, we have an algorithm to estimate "good" parameters θ\vec{\theta}
  • this algorithm converges on a local minimum of the cost function, depending on the starting parameters
  • we are not guaranteed to find a global minimum
https://www.fromthegenesis.com/gradient-descent-part1/
Bio Data-Science

Teaching Computers

  • Machine Learning is about teaching computers to learn from data
  • In many cases, we want to minimize the error of a model
  • Gradient Descent is a general algorithm to minimize errors
  • There are many variations of Gradient Descent
Bio Data-Science
🧠 Learning Rates
  • the slope/gradients shows the direction but, we must decide for a step width (learning rate α\alpha)
  • Learning rate is too big: We can overshoot the target
  • Learning rate is too small: Slow learning

https://www.fromthegenesis.com/gradient-descent-part1/
Bio Data-Science
🧠 Improvement: Adaptable Learning Rates
  • Steps are larger in steeper areas
  • Steps get smaller over time
https://wiki.tum.de/display/lfdv/Adaptive+Learning+Rate+Method?preview=/23573655/25008837/perceptron_learning_rate local minima.png
Bio Data-Science
🧠 Other Improvements
  • Momentum strategy - consider direction of the last steps
  • Stochastic Gradient Descent - Jump Around based on a subset of training data!

https://www.researchgate.net/publication/2295939_Lecture_Notes_in_Computer_Science/figures?lo=1
Bio Data-Science

When to stop learning?

  • Learning Curves plot the development of the training error over the steps (iterations of gradient descent)

  • The loss (value of the cost function (JJ)) will never be zero
  • except for a perfect (over)fit on the training data
Bio Data-Science
🧠 Solution: Stopping criteria
  • maximum number of iterations
  • early-stopping: no (big) improvement for xx iterations
https://ai.stackexchange.com/questions/22369/why-the-cost-loss-starts-to-increase-for-some-iterations-during-the-training-pha
Bio Data-Science

Different Learning Algorithms

  • for different machine learning models use different algorithms
  • almost all follow the basic principle of gradient descent
  • they can differ in both speed and accuracy
  • take the default or test them during model selection
https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/#:~:text=Adam is a replacement optimization,sparse gradients on noisy problems.
Bio Data-Science

✍️ Task

Bio Data-Science

2.6.2 Accuracy Measures and Learning Curves

🎯 Learning objectives

You will be able to

  • use different accuracy measures
  • use learning curves to identify over-fitting
Bio Data-Science

What is a good model?

  • A good model makes an accurate prediction
  • Models with a low flexibility (high bias) are not accurate because they are too simple do model the real world
  • Models with a high flexibility (high variance) are not accurate because they are too complex and overfit the training data
https://medium.com/@ivanreznikov/stop-using-the-same-image-in-bias-variance-trade-off-explanation-691997a94a54
Bio Data-Science

🧠 Cost and Accuracy Measures

  • Residual Sum of Squares
    • Squared error for the whole data set with nn observations

    RSS=i=1n(yiyi^)2\text{RSS}=\sum^n_{i=1}(y_i-\hat{y_i})^2

  • Mean Squared Error
    • Squared error for the average observation

    MSE=1ni=1n(yiyi^)2\text{MSE}=\frac{1}{n}\sum^n_{i=1}(y_i-\hat{y_i})^2

  • Root Mean Squared Error
    • MSE\text{MSE} corrected for the dimension

    RMSE=1ni=1n(yiyi^)2\text{RMSE}=\sqrt{\frac{1}{n}\sum^n_{i=1}(y_i-\hat{y_i})^2}

https://scikit-learn.org/stable/modules/classes.html#regression-metrics
Bio Data-Science
  • Mean Absolute Error

    • keeps the unit of the predicted variable

    MAE=1ni=1nyiyi^\text{MAE}=\frac{1}{n}\sum^n_{i=1}|y_i-\hat{y_i}|

  • Mean Absolute Percentage Error

    • allows comparison between different data sets

    MAPE=1ni=1nyiyi^yi\text{MAPE}=\frac{1}{n}\sum^n_{i=1}|\frac{y_i-\hat{y_i}}{y_i}|

    • not defined for yi=0y_i=0, biased for small yiy_i
Bio Data-Science

🧠 Learning Curves

  • A learning curve is a plot of model learning performance over experience or time (e.g., number of iterations of gradient descent or amount of training data).
  • To really understand how a model behaves, the data must be split in a training an validation set
    • Training Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning
    • Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Bio Data-Science
The Learning Curve Plot
  • Learning Curve: Line plot of learning (y-axis)
    • Loss/Score can be any accuracy measure
  • over experience (x-axis).
    • for instance number of iterations
    • size of the training set
    • duration of training
https://upload.wikimedia.org/wikipedia/commons/2/24/Learning_Curves_(Naive_Bayes).png
Bio Data-Science
🧠 Examples
  • Model too flexible and over-fits the data
  • small training error but large validation error
Bio Data-Science
🧠 Examples
  • Model is flexible, but training was stopped to early
  • We would expect a better fit with more training
Bio Data-Science
🧠 Examples
  • Model is very flexible and over-fits the training data
  • after some time so the loss on unseen validation data increases a little
Bio Data-Science
🧠 Examples
  • Example for a good fit
  • Usually we would expect a better fit on the training set than the validation set
Bio Data-Science
🧠 Examples
  • Training set might be to small
  • Data fits the training set well but this does not translate to the validation set
Bio Data-Science
🧠 Examples
  • Validation set might be to small
  • Depending on the selection of the values in the set the loss fluctuates a lot (high variance)
Bio Data-Science

✍️ Case Study

  • we train different model on synthetic data set
  • we use training curve, to see how the models behave during training

6.3 Accuracy Measures and Learning Curves

⌛ 25 min

Bio Data-Science