# Ensemble Models: Bagging Techniques

### Overview

We have learnt about what bagging is in *Ensemble Models: Overview*, to recap, bagging is:

- In bagging (Bootstrap Aggregating), a set of weak learners are combined to create a strong learner that obtains better performance than a single one.
- Bagging helps to decrease the model’s variance.
- Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier.

In this blog, we will use random forest as an example to illustrate how bagging works**Bagging**works as follows:

- Multiple subsets are created from the original dataset, selecting observations with replacement.
- A base model (weak model) is created on each of these subsets.
- The models run in parallel and are independent of each other.
- The final predictions are determined by combining the predictions from all the models.

Next let's consider random forest, a model that fully utilized the idea of bagging in its procedure.

### Random Forest

#### 1. Definition

- A random forest consists of multiple random decision trees. Two types of randomnesses are built into the trees.
- First, each tree is built on a random sample from the original data.
- Second, at each tree node, a subset of features are randomly selected to generate the best split. (Key difference from Bagging algorithms)

- An ensemble model that is widely applied (as it can be parallelized)
- Designed to solve the overfitting issue in Decision Tree. The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have lower variance but not at the cost of increasing the bias.
**Procedure**

Execute until every last combination is exhausted- Bootstrapping: Create a bootstraped dataset: randomly select samples from the dataset until it reaches the same size as the original sample (we're allowed to pick the same sample more than once);
- Decision Tree Construction: Create a decision tree using the bootstraped dataset, but only use a random subset of variables/features at each step (i.e each decision node selection);
**Bagging**: defined as bootstrapping the data plus using the aggregation to make a decision- given a new instance, run through all the decision trees (the enitre random forest) and obtain the sum of votes for y = 1 and y = 0; (this step is called “aggregation”)
- decide on result of aggregation
using the result with higher vote;

- Choose the most accurate random forest:
- Measure Accuracy based on the Out-of-Bag samples (CV)
compute the Out-of-Bag Error as the # of samples which Bagging classifies wrongly - Choice of number of variable used per step affects accuracy (optimized during CV): Usually choose the square root of the # of total variables and try a few settings above and below that value

- Measure Accuracy based on the Out-of-Bag samples (CV)
- The low correlation between models is the key
**WARNING**: RF is often considered as**Bagging**model while it is**not always true**, see this link

#### 2. Pros & Cons

**Pros**

- The power of handle large data sets with higher dimensionality (as each tree select much less features in its construction)
- The model outputs importance of variable, which can be a very handy feature (
`rf.feature_importance_`

) - Balancing errors in data sets where classes are imbalanced.
- It has an effective method for estimating missing data and maintains accuracy when large proportion of the data are missing.
- Using the out of bag error estimate for selection the most accurate random forest removes the need for a set aside test set.

**Cons**

- It has very poor interpretability
- Does not work well for extrapolation to predict for data that is outside of the bounds of your original training data
- Random forest can feel like a black box approach for a statistical modelers we have very little control on what the model does. You can at best try different parameters and random seeds.

#### 3. Simple Implementation

This is a template inspired by the Kaggle notebooks. I shall thank those writers whose code I borrowed from.

Also note that here an aws s3 connection is made, which automatically makes the process parallelized.

1 | import pandas as pd |

Ensemble Models: Bagging Techniques

https://criss-wang.github.io/post/blogs/supervised/ensemble-1/