Blogs · Boosting · Supervised Learning

Ensemble Models: Boosting Techniques

You know XGBoost, but do you KNOW XGBoost?

2021.02.04 · 11 min read · by Zhenlin Wang · updated 2021-10-11

Overview

We will discuss 3 major boosting models: AdaBoost, Gradient Boost and XGBoost.

AdaBoost

1. Definition

2. Pros and Cons

Pros

Cons

3. Comparison with Random Forest

4. Sample Code

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split,GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


train = pd.read_pickle("train.pkl")

X = train.drop(['Survived'], axis = 1)
y = train["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .33, random_state=0)

# Feature Scaling
## We will be using standardscaler to transform

st_scale = StandardScaler()

## transforming "train_x"
X_train = st_scale.fit_transform(X_train)
## transforming "test_x"
X_test = st_scale.transform(X_test)

adaBoost = AdaBoostClassifier(base_estimator=None,
                              learning_rate=1.0,
                              n_estimators=100)

adaBoost.fit(X_train, y_train)

y_pred = adaBoost.predict(X_test)


accuracy_score(y_test, y_pred)



n_estimators = [100,140,145,150,160, 170,175,180,185];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
learning_r = [0.1,1,0.01,0.5]

parameters = {'n_estimators':n_estimators,
              'learning_rate':learning_r
              
        }
grid = GridSearchCV(AdaBoostClassifier(base_estimator= None, ## If None, then the base estimator is a decision tree.
                                     ),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y) 

print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)


adaBoost_grid = grid.best_estimator_
adaBoost_grid.score(X,y)

GBM (Gradient Boosting)

1. Pros & Cons

Pros

Cons

2. AdaBoost vs GBM

3. Random Forest vs GBM

4. Application

5. Sample Code implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc


train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.set_index("PassengerId", inplace=True)
test.set_index("PassengerId", inplace=True)

# generate training target set (y_train)
y_train = train["Survived"]
# delete column "Survived" from train set
train.drop(labels="Survived", axis=1, inplace=True)

train_test =  train.append(test)
# delete columns that are not used as features for training and prediction
columns_to_drop = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"]
train_test.drop(labels=columns_to_drop, axis=1, inplace=True)
# convert objects to numbers by pandas.get_dummies
train_test_dummies = pd.get_dummies(train_test, columns=["Sex"])

train_test_dummies.fillna(value=0.0, inplace=True)
# generate feature sets (X)
X_train = train_test_dummies.values[0:891]
X_test = train_test_dummies.values[891:]

scaler = MinMaxScaler()
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.transform(X_test)

X_train_sub, X_validation_sub, y_train_sub, y_validation_sub = train_test_split(X_train_scale, y_train, random_state=0)

learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in learning_rates:
    gb = GradientBoostingClassifier(n_estimators=20, learning_rate = learning_rate, max_features=2, max_depth = 2, random_state = 0)
    gb.fit(X_train_sub, y_train_sub)
    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb.score(X_train_sub, y_train_sub)))
    print("Accuracy score (validation): {0:.3f}".format(gb.score(X_validation_sub, y_validation_sub)))
    print()
    
gb = GradientBoostingClassifier(n_estimators=20, learning_rate = 0.5, max_features=2, max_depth = 2, random_state = 0)
gb.fit(X_train_sub, y_train_sub)
predictions = gb.predict(X_validation_sub)

print("Confusion Matrix:")
print(confusion_matrix(y_validation_sub, predictions))
print()
print("Classification Report")
print(classification_report(y_validation_sub, predictions))

y_scores_gb = gb.decision_function(X_validation_sub)
fpr_gb, tpr_gb, _ = roc_curve(y_validation_sub, y_scores_gb)
roc_auc_gb = auc(fpr_gb, tpr_gb)

XGBoost

1. Advantage of XGBoost

LightGBM

1. Advantages of Light GBM

2. Code sample: XGBoost vs LightGBM

#importing standard libraries 
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame 

#import lightgbm and xgboost 
import lightgbm as lgb 
import xgboost as xgb 

#loading our training dataset 'adult.csv' with name 'data' using pandas 
data=pd.read_csv('adult.csv',header=None) 

#Assigning names to the columns 
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income'] 

#glimpse of the dataset 
data.head() 

# Label Encoding our target variable 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder() 
l.fit(data.Income) 

l.classes_ 
data.Income=Series(l.transform(data.Income))  #label encoding our target variable 
data.Income.value_counts() 

 

#One Hot Encoding of the Categorical features 
one_hot_workclass=pd.get_dummies(data.workclass) 
one_hot_education=pd.get_dummies(data.education) 
one_hot_marital_Status=pd.get_dummies(data.marital_Status) 
one_hot_occupation=pd.get_dummies(data.occupation)
one_hot_relationship=pd.get_dummies(data.relationship) 
one_hot_race=pd.get_dummies(data.race) 
one_hot_sex=pd.get_dummies(data.sex) 
one_hot_native_country=pd.get_dummies(data.native_country) 

#removing categorical features 
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True) 

 

#Merging one hot encoded features with our dataset 'data' 
data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1) 

#removing dulpicate columns 
_,i = np.unique(data.columns, return_index=True) 
data=data.iloc[:, i] 

#Here our target variable is 'Income' with values as 1 or 0.  
#Separating our data into features dataset x and our target dataset y 
x=data.drop('Income',axis=1) 
y=data.Income 

 

#Imputing missing values in our target variable 
y.fillna(y.mode()[0],inplace=True) 

#Now splitting our dataset into test and train 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

#The data is stored in a DMatrix object 
#label is used to define our outcome variable
dtrain=xgb.DMatrix(x_train,label=y_train)
dtest=xgb.DMatrix(x_test)

#setting parameters for xgboost
parameters={'max_depth':7, 'eta':1, 'silent':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}

#training our model 
num_round=50
from datetime import datetime 
start = datetime.now() 
xg=xgb.train(parameters,dtrain,num_round) 
stop = datetime.now()

#Execution time of the model 
execution_time_xgb = stop-start 
print(f'execution_time_xgb: {execution_time_xgb}')
#datetime.timedelta( , , ) representation $\implies$ (days , seconds , microseconds) 

#now predicting our model on test set 
ypred=xg.predict(dtest) 
display(ypred)

#Converting probabilities into 1 or 0  
for i in range(0, 9769): 
    if ypred[i] >= .5:
        ypred[i] = 1 
    else:
        ypred[i]=0  
        
#calculating accuracy of our model 
from sklearn.metrics import accuracy_score 
accuracy_xgb = accuracy_score(y_test,ypred) 
print(f'accuracy_xgb: {accuracy_xgb}')

train_data=lgb.Dataset(x_train,label=y_train)

#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':7,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']

#Here we have set max_depth in xgb and LightGBM to 7 to have a fair comparison between the two.

#training our model using light gbm
num_round=50
start=datetime.now()
lgbm=lgb.train(param,train_data,num_round)
stop=datetime.now()

#Execution time of the model
execution_time_lgbm = stop-start
print(f'execution_time_lgbm: {execution_time_lgbm}')

#predicting on test set
ypred2=lgbm.predict(x_test)
display(ypred2[0:5])  # showing first 5 predictions

#converting probabilities into 0 or 1
for i in range(0,9769):
    if ypred2[i]>=.5:       # setting threshold to .5
        ypred2[i]=1
    else:
        ypred2[i]=0
        
#calculating accuracy
accuracy_lgbm = accuracy_score(ypred2,y_test)
print(f'accuracy_lgbm: {accuracy_lgbm}')
display(y_test.value_counts())

from sklearn.metrics import roc_auc_score
#calculating roc_auc_score for xgboost
auc_xgb =  roc_auc_score(y_test,ypred)
print(f'auc_xgb: {auc_xgb}')

#calculating roc_auc_score for light gbm. 
auc_lgbm = roc_auc_score(y_test,ypred2)
print(f'auc_lgbm: {auc_lgbm}')
comparison_dict = {'accuracy score':(accuracy_lgbm, accuracy_xgb),'auc score':(auc_lgbm,auc_xgb),'execution time':(execution_time_lgbm, execution_time_xgb)}

#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb. 
comparison_df = DataFrame(comparison_dict) 
comparison_df.index= ['LightGBM','xgboost'] 
display(comparison_df)

3. General Pros and cons of boosting

Pros

Cons

Conclusion

Here we end the discussion about ensemble models. It was a fun and challenging topic. While most users of these model won’t need to understand every nitty-gritty of these models, these profound theories laid significant foundations for future research on supervised ensemble learning models (and even meta-learning). In the next month, I’ll share some posts about unsupervised learning. This is even large a topic, and I expect the content to be even deeper. Good luck, me and everyone!