Gradient Boosting Machine
Machine Learning
Jesus A. Gonzalez
August 10, 2019
GRadient Boosted Trees
- Reference:
- Machine Learning with Boosting: A Beginner’s Guide
- Scott Hartshorn
Gradient Boosted Trees
- Combines multiple decision trees SEQUENTIALLY
- If then questions
- Split the data
- Improvement with respect to overfitting
- Simple decision trees tend to overfit
Boosting with Trees
Boosting with Trees
Boosting
- Combining weak learners to create a stronger learner
- Weak learner
- A learner that is slightly correlated to the true classification
- Labels examples better than random guessing
- Good weak learners for boosting?
- Those that are very accurate even if this is for a limited scope of the problem
- Combines weak learners such that each weak learner solves a limited section of the problem
How Boosting Works?
- Learns how to use the features’ values to create groups of data points with similar final answers
- Learns how to adjust the result in each group to get better results at the end
- The resulting trees + groups + adjusting values are used to predict
The Boosting Abstract Process
Splitting the Groups
- Computationally expensive
- Uses regression trees to group the data
- Groups according to how much error remains in each data point
Adjusting the Output for the Next Iteration
- For each group
- Compute a value to add/substract to all data points of the group
- We get a new value and get ready for the next iteration
Result of a set of Gradient Boosting Trees
- The splits that created the groups for each tree
- The amount of change applied to each group
Regression vs Classification
- Both use the same method for splitting the data into groups
- Method to get the amount of change to apply is different
- Regression is straight forward
- Split training data based on error at each step
- Classification
- Change categories into real values and back again
- For binary classification
- For three or more categories
- One vs the-others strategy
- Series of two category problems
The Boosting Regression Process
- Predict an initial estimate of 0.0
- Use the true values to calculate the error in the initial prediction
- Split the data into groups using the features of the data
- The goal is putting data with similar error in the same group
- For each group, find the average error
- For every data point in that group, add the average error to the current prediction
- Calculate the new error for each point for the new prediction
- Repeat from 3 to 6 as many times as required
Parameters
- Depth of the trees
- 1: fit error with 2 horizontal lines
- 2: fit error with 4 horizontal lines
- 3: fit error with 8 horizontal lines
- d: fit error with \(2^d\) horizontal lines
- Increasing the depth increases the time to fit
Parameters
- Learning rate
- How much of the correlation factor should we apply?
- High learning rate may lead to overfitting
- If our data contains noise
- High learning rate leads to faster convergence
- Low learning rate preferred if we can afford the duration of the training time
- Learning rate takes values between 0 and 1, > 0
How a GBM Focuses on Parts of the Problem?
- Two approaches
- Substracts the current prediction to the true value at each iteration
- Then, it fits the regression tree to the current error
- Approach embedded in the regression tree
- Minimizes the summed squared error
Regression Boosting Example
- Function: \(y = sin(x) + 1\)
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
Regression Boosting Example
GBM for Classification
- Now we have a discrete class
- More complicated than regression
- Change categories into numbers
- Work categories as one vs others
- Deal with two types of ranges of numbers
- 0 to 1: to represent the two classes
- \(-\infty\) to \(+\infty\): to keep updating the error after each iteration for as many iterations as needed
- Convert from a range to the other
- 0-1 to infinite
- Logit function: \(ln\frac{p}{1-p}\)
- infinite to 0-1
- exp function: \(\frac{1}{1 + e^{-x}}\)
GBM for Classification - Steps
- Assigns 0 and 1 to categories
- Set initial prediction equal to average value
- Compute current error
- Split into groups using decision tree (regression)
- Compute amount of change for each point
- Convert all points to the infinite range
- Add amount of change to each point
- Convert all points back to 0-1 range
- Iterate as many times as needed
- Finish fitting the model