Gradient Boosting Machine

Machine Learning

Jesus A. Gonzalez

August 10, 2019

GRadient Boosted Trees

Reference:
- Machine Learning with Boosting: A Beginner’s Guide
- Scott Hartshorn

Gradient Boosted Trees

Combines multiple decision trees SEQUENTIALLY
- If then questions
- Split the data
Improvement with respect to overfitting
- Simple decision trees tend to overfit

Boosting

Combining weak learners to create a stronger learner
Weak learner
- A learner that is slightly correlated to the true classification
- Labels examples better than random guessing
Good weak learners for boosting?
- Those that are very accurate even if this is for a limited scope of the problem
Combines weak learners such that each weak learner solves a limited section of the problem

How Boosting Works?

Learns how to use the features’ values to create groups of data points with similar final answers
Learns how to adjust the result in each group to get better results at the end
The resulting trees + groups + adjusting values are used to predict

The Boosting Abstract Process

Splitting the Groups

Computationally expensive
Uses regression trees to group the data
Groups according to how much error remains in each data point

Adjusting the Output for the Next Iteration

For each group
- Compute a value to add/substract to all data points of the group
- We get a new value and get ready for the next iteration

Result of a set of Gradient Boosting Trees

The splits that created the groups for each tree
The amount of change applied to each group

Regression vs Classification

Both use the same method for splitting the data into groups
Method to get the amount of change to apply is different
- Regression is straight forward
  - Split training data based on error at each step
- Classification
  - Change categories into real values and back again
  - For binary classification
    - Yes - 1.0
    - No - 0.0
  - For three or more categories
    - One vs the-others strategy
      - Series of two category problems

The Boosting Regression Process

Predict an initial estimate of 0.0
Use the true values to calculate the error in the initial prediction
Split the data into groups using the features of the data
- The goal is putting data with similar error in the same group
For each group, find the average error
For every data point in that group, add the average error to the current prediction
Calculate the new error for each point for the new prediction
Repeat from 3 to 6 as many times as required

Parameters

Depth of the trees
- 1: fit error with 2 horizontal lines
- 2: fit error with 4 horizontal lines
- 3: fit error with 8 horizontal lines
- d: fit error with \(2^d\) horizontal lines
- Increasing the depth increases the time to fit

Parameters

Learning rate
- How much of the correlation factor should we apply?
- High learning rate may lead to overfitting
  - If our data contains noise
- High learning rate leads to faster convergence
  - If there is not noise
- Low learning rate preferred if we can afford the duration of the training time
- Learning rate takes values between 0 and 1, > 0

How a GBM Focuses on Parts of the Problem?

Two approaches
1. Substracts the current prediction to the true value at each iteration
  - Then, it fits the regression tree to the current error
2. Approach embedded in the regression tree
  - Minimizes the summed squared error

Regression Boosting Example

Function: \(y = sin(x) + 1\)

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

GBM for Classification

Now we have a discrete class
More complicated than regression
- Change categories into numbers
- Work categories as one vs others
- Deal with two types of ranges of numbers
  - 0 to 1: to represent the two classes
  - \(-\infty\) to \(+\infty\): to keep updating the error after each iteration for as many iterations as needed
- Convert from a range to the other
  - 0-1 to infinite
    - Logit function: \(ln\frac{p}{1-p}\)
  - infinite to 0-1
    - exp function: \(\frac{1}{1 + e^{-x}}\)

GBM for Classification - Steps

Assigns 0 and 1 to categories
Set initial prediction equal to average value
Compute current error
Split into groups using decision tree (regression)
Compute amount of change for each point
Convert all points to the infinite range
Add amount of change to each point
Convert all points back to 0-1 range
Iterate as many times as needed
Finish fitting the model

Gradient Boosting Machine

Machine Learning

GRadient Boosted Trees

Gradient Boosted Trees

Boosting with Trees

Boosting with Trees

Boosting

How Boosting Works?

The Boosting Abstract Process

Splitting the Groups

Adjusting the Output for the Next Iteration

Result of a set of Gradient Boosting Trees

Regression vs Classification

The Boosting Regression Process

Parameters

Parameters

How a GBM Focuses on Parts of the Problem?

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

Regression Boosting Example

GBM for Classification

GBM for Classification - Steps