Feature Selection & Model Selections
Overview
Running machine learning models have become much easier in recent years. The prevalence of tutorials and model packages makes it much more convenient for people to apply various theoretically complex algorithms on their datasets and thrive. So to excel in the field of data science, one cannot simple KNOW how to use models, but also appreciate each model's significance and select proper models wisely. That's where feature selections and model selections come in. Both turn out to be challenging and extremely useful in the same time. In light of this, I want to take down the notes I learned through practice and tutorials some key aspects of these two things.
Feature Selection
- Benefits
- It enables the machine learning algorithm to train faster.
- It reduces the complexity of a model and makes it easier to interpret.
- It improves the accuracy of a model if the right subset is chosen.
- It reduces Overfitting
- Methods
Here we discuss about some widely used methods for feature selections. To facilitate the demo code, we require the following packages to be applied and data being tuned:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, Ridge, Lasso
from sklearn.feature_selection import RFE
%matplotlib inline
#Loading the dataset
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target
X = df.drop("MEDV",1) #Feature Matrix
y = df["MEDV"] #Target Variable
Filter Methods
- No mining algorithm included
- Uses the exact assessment criterion which includes distance, information, dependency, and consistency.
- The filter method uses the principal criteria of ranking technique and uses the rank ordering method for variable selection.
- Generally used as a dasta preprocessing step
- Several main filter methods based on the variable attributes:
Wrapper Methods
- workflow: - Use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset
- Computationally expensive
- 3 Types:
- Forward Selection: An iterative method
- Start with having no feature in the model.
- In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
- Backward Elimination: An iterative method
- Start with all the features and removes the least significant feature at each iteration which improves the performance of the model.
- We repeat this until no improvement is observed on removal of features.
- E.g. If the p-value is above 0.05 then we remove the feature, else we keep it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22# Adding constant column of ones, mandatory for sm.OLS model
X_1 = sm.add_constant(X)
# Fitting sm.OLS model
model = sm.OLS(y,X_1).fit()
display(model.pvalues)
# Backward Elimination
cols = list(X.columns)
pmax = 1
while (len(cols)>0):
p= []
X_1 = X[cols]
X_1 = sm.add_constant(X_1)
model = sm.OLS(y,X_1).fit()
p = pd.Series(model.pvalues.values[1:],index = cols)
pmax = max(p)
feature_with_p_max = p.idxmax()
if(pmax>0.05):
cols.remove(feature_with_p_max)
else:
break
selected_features_BE = cols
print(selected_features_BE)
- Recursive Feature elimination: A greedy optimization algorithm
- It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration.
- It constructs the next model with the left features until all the features are exhausted.
- It then ranks the features based on the order of their elimination Here we took LinearRegression model with 7 features and RFE gave feature ranking as above, but the selection of number '7' was random. Now we need to find the optimum number of features, for which the accuracy is the highest. We do that by using loop starting with 1 feature and going up to 13. We then take the one for which the accuracy is highest.
1
2
3
4
5
6
7
8
9
10
11model = LinearRegression()
#Initializing RFE model
rfe = RFE(model, 7)
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)
#Fitting the data to model
model.fit(X_rfe,y)
print(rfe.support_)
print(rfe.ranking_)
False False False True True True False True True False True False True] [
2 4 3 1 1 1 7 1 1 5 1 6 1] [As seen from above code, the optimum number of features is 10. We now feed 10 as number of features to RFE and get the final set of features given by RFE method1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20#no of features
nof_list=np.arange(1,13)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
model = LinearRegression()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))1
2
3
4
5
6
7
8
9
10
11cols = list(X.columns)
model = LinearRegression()
#Initializing RFE model
rfe = RFE(model, 10)
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)
#Fitting the data to model
model.fit(X_rfe,y)
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)
- (*) Bidirectional Elimination: A combination of Forward Selection & Backword Elimination
- Forward Selection: An iterative method
Self-defined Methods
There are many interesting methods that can be directly applied in experimentations. However, one method that caught my eyes is the Boruta method:- Boruta Method (Using shadow features and random forest)
- The main reason I liked this is because its application on Random Forest and XGBoost models.
- It generally works well with well structured data and relatively smaller datasets.
- In the hindsight, it is still relatively slower as compared to some simpler selection criterion, and it does not handle multicollinearity immediately.
- checkout this python tutorial for more details
- Boruta Method (Using shadow features and random forest)
Embedded Methods
It combines the qualities of filter and wrapper methods. It's implemented by algorithms that have their own built-in feature selection methodsWorkflow
Here in the demo code we will do feature selection using Lasso regularization. If the feature is irrelevant, lasso penalizes it's coefficient and make it 0. Hence the features with coefficient = 0 are removed and the rest are taken.
1
2
3
4
5
6
7
8
9
10
11
12
13
14reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" % reg.score(X,y))
in LassoCV: 0.724820 Best alpha using built-
in LassoCV: 0.702444 Best score using built-
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
10 variables and eliminated the other 3 variables Lasso picked
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
Filter vs Wrapper
Now let us make a comparison between filter methods and wrapper methods, the two most commonly used ways in feature selection.
Characteristics Filter Method Wrapper Methods Measure of feature relevance correlation with dependent variable actually training a model on a subset of feature Speed Much faster Slower due to model training Performance Evaluation statistical methods for evaluation Model results cross validation Quality of feature set selected May be suboptimal Guaranteed to output optimal/near-optimal feature set Overfitting ? Less likely Much more prone to
Model Selection
Here we must clarify one important conceptual misunderstanding:
Note: Classical Model selection mainly focuses on performing metrics evaluations through different models, tuning the model parameter and variating the training datasets. The choice of model in the end is often manual. Hence, it differs from the automated model selection procedure where the final selection of model is also done automatically. The latter is often known as AutoML, and has gained quick wide popularity in recent years.
We now think about what are the main strategies to improve model performance:
- Use a more complicated/more flexible model
- Use a less complicated/less flexible model
- Tuning hyperparameters
- Gather more training samples
- Gather more data to add features to each sample
Clearly, the first 4 are model selection strategies, and the last one is feature selection.
When we make these adjustments, we must keep in mind the The Bias-variance trade-off
:
bias
: Usually the case where the modelunderfits
, i.e. it does not have enough model flexibility to suitably account for all the features in the datavariance
: Usually the case where the modeloverfits
, i.e. so much model flexibility that the model ends up accounting for random errors as well as the underlying data distribution- For high-bias models, the performance of the model on the validation set is similar to the performance on the training set.
- For high-variance models, the performance of the model on the validation set is far worse than the performance on the training set.
We can easily visualize this via the learning curve
In the meantime, we observe from the validation curve below that model complexity/hyperparameter choices affect the model performances as well
For more details on metrics evaluation and hyperparameter tuning with feedback from validation sets, interested readers can read my blogs on these topics as well.
Feature Selection & Model Selections
https://criss-wang.github.io/post/blogs/mlops/feature-and-model-selections/