Ensemble Models: Bagging Techniques
Overview
We have learnt about what bagging is in Ensemble Models: Overview, to recap, bagging is:
- In bagging (Bootstrap Aggregating), a set of weak learners are combined to create a strong learner that obtains better performance than a single one.
- Bagging helps to decrease the model’s variance.
- Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier.
In this blog, we will use random forest as an example to illustrate how bagging works
Bagging works as follows:
- Multiple subsets are created from the original dataset, selecting observations with replacement.
- A base model (weak model) is created on each of these subsets.
- The models run in parallel and are independent of each other.
- The final predictions are determined by combining the predictions from all the models.
Next let's consider random forest, a model that fully utilized the idea of bagging in its procedure.
Random Forest
1. Definition
- A random forest consists of multiple random decision trees. Two types of randomnesses are built into the trees.
- First, each tree is built on a random sample from the original data.
- Second, at each tree node, a subset of features are randomly selected to generate the best split. (Key difference from Bagging algorithms)
- An ensemble model that is widely applied (as it can be parallelized)
- Designed to solve the overfitting issue in Decision Tree. The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have lower variance but not at the cost of increasing the bias.
- Procedure
Execute until every last combination is exhausted- Bootstrapping: Create a bootstraped dataset: randomly select samples from the dataset until it reaches the same size as the original sample (we're allowed to pick the same sample more than once);
- Decision Tree Construction: Create a decision tree using the bootstraped dataset, but only use a random subset of variables/features at each step (i.e each decision node selection);
- Bagging: defined as bootstrapping the data plus using the aggregation to make a decision
- given a new instance, run through all the decision trees (the enitre random forest) and obtain the sum of votes for y = 1 and y = 0; (this step is called “aggregation”)
- decide on result of aggregation
using the result with higher vote;
- Choose the most accurate random forest:
- Measure Accuracy based on the Out-of-Bag samples (CV)
compute the Out-of-Bag Error as the # of samples which Bagging classifies wrongly - Choice of number of variable used per step affects accuracy (optimized during CV): Usually choose the square root of the # of total variables and try a few settings above and below that value
- Measure Accuracy based on the Out-of-Bag samples (CV)
- The low correlation between models is the key
WARNING: RF is often considered as Bagging model while it is not always true, see this link
2. Pros & Cons
Pros
- The power of handle large data sets with higher dimensionality (as each tree select much less features in its construction)
- The model outputs importance of variable, which can be a very handy feature (
rf.feature_importance_
) - Balancing errors in data sets where classes are imbalanced.
- It has an effective method for estimating missing data and maintains accuracy when large proportion of the data are missing.
- Using the out of bag error estimate for selection the most accurate random forest removes the need for a set aside test set.
Cons
- It has very poor interpretability
- Does not work well for extrapolation to predict for data that is outside of the bounds of your original training data
- Random forest can feel like a black box approach for a statistical modelers we have very little control on what the model does. You can at best try different parameters and random seeds.
3. Simple Implementation
This is a template inspired by the Kaggle notebooks. I shall thank those writers whose code I borrowed from.
Also note that here an aws s3 connection is made, which automatically makes the process parallelized.
1 | import pandas as pd |
Ensemble Models: Bagging Techniques
https://criss-wang.github.io/post/blogs/supervised/ensemble-1/