# Feature Selection & Model Selections

### Overview

Running machine learning models have become much easier in recent years. The prevalence of tutorials and model packages makes it much more convenient for people to apply various theoretically complex algorithms on their datasets and thrive. So to excel in the field of data science, one cannot simple KNOW how to use models, but also **appreciate** each model's significance and **select** proper models wisely. That's where feature selections and model selections come in. Both turn out to be challenging and extremely useful in the same time. In light of this, I want to take down the notes I learned through practice and tutorials some key aspects of these two things.

### Feature Selection

- Benefits
- It enables the machine learning algorithm to train faster.
- It reduces the complexity of a model and makes it easier to interpret.
- It improves the accuracy of a model if the right subset is chosen.
- It reduces Overfitting

- Methods

Here we discuss about some widely used methods for feature selections. To facilitate the demo code, we require the following packages to be applied and data being tuned:

1 | from sklearn.datasets import load_boston |

Filter Methods

- No mining algorithm included
- Uses the exact assessment criterion which includes distance, information, dependency, and consistency.
- The filter method uses the principal criteria of ranking technique and uses the rank ordering method for variable selection.
- Generally used as a dasta preprocessing step
- Several main filter methods based on the variable attributes:

Wrapper Methods

- workflow: - Use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset
- Computationally expensive
- 3 Types:
`Forward Selection`: An iterative method- Start with having no feature in the model.
- In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

`Backward Elimination`: An iterative method- Start with all the features and removes the least significant feature at each iteration which improves the performance of the model.
- We repeat this until no improvement is observed on removal of features.
- E.g. If the p-value is above 0.05 then we remove the feature, else we keep it.
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22# Adding constant column of ones, mandatory for sm.OLS model

X_1 = sm.add_constant(X)

# Fitting sm.OLS model

model = sm.OLS(y,X_1).fit()

display(model.pvalues)

# Backward Elimination

cols = list(X.columns)

pmax = 1

while (len(cols)>0):

p= []

X_1 = X[cols]

X_1 = sm.add_constant(X_1)

model = sm.OLS(y,X_1).fit()

p = pd.Series(model.pvalues.values[1:],index = cols)

pmax = max(p)

feature_with_p_max = p.idxmax()

if(pmax>0.05):

cols.remove(feature_with_p_max)

else:

break

selected_features_BE = cols

print(selected_features_BE)

`Recursive Feature elimination`: A greedy optimization algorithm- It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration.
- It constructs the next model with the left features until all the features are exhausted.
- It then ranks the features based on the order of their elimination Here we took LinearRegression model with 7 features and RFE gave feature ranking as above, but the selection of number '7' was random. Now we need to find the optimum number of features, for which the accuracy is the highest. We do that by using loop starting with 1 feature and going up to 13. We then take the one for which the accuracy is highest.
1

2

3

4

5

6

7

8

9

10

11model = LinearRegression()

#Initializing RFE model

rfe = RFE(model, 7)

#Transforming data using RFE

X_rfe = rfe.fit_transform(X,y)

#Fitting the data to model

model.fit(X_rfe,y)

print(rfe.support_)

print(rfe.ranking_)

False False False True True True False True True False True False True] [

2 4 3 1 1 1 7 1 1 5 1 6 1] [As seen from above code, the optimum number of features is 10. We now feed 10 as number of features to RFE and get the final set of features given by RFE method1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20#no of features

nof_list=np.arange(1,13)

high_score=0

#Variable to store the optimum features

nof=0

score_list =[]

for n in range(len(nof_list)):

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)

model = LinearRegression()

rfe = RFE(model,nof_list[n])

X_train_rfe = rfe.fit_transform(X_train,y_train)

X_test_rfe = rfe.transform(X_test)

model.fit(X_train_rfe,y_train)

score = model.score(X_test_rfe,y_test)

score_list.append(score)

if(score>high_score):

high_score = score

nof = nof_list[n]

print("Optimum number of features: %d" %nof)

print("Score with %d features: %f" % (nof, high_score))1

2

3

4

5

6

7

8

9

10

11cols = list(X.columns)

model = LinearRegression()

#Initializing RFE model

rfe = RFE(model, 10)

#Transforming data using RFE

X_rfe = rfe.fit_transform(X,y)

#Fitting the data to model

model.fit(X_rfe,y)

temp = pd.Series(rfe.support_,index = cols)

selected_features_rfe = temp[temp==True].index

print(selected_features_rfe)

- (*)
`Bidirectional Elimination`: A combination of*Forward Selection*&*Backword Elimination*

Self-defined Methods

There are many interesting methods that can be directly applied in experimentations. However, one method that caught my eyes is the Boruta method:- Boruta Method (Using shadow features and random forest)
- The main reason I liked this is because its application on Random Forest and XGBoost models.
- It generally works well with well structured data and relatively smaller datasets.
- In the hindsight, it is still relatively slower as compared to some simpler selection criterion, and it does not handle
**multicollinearity**immediately. - checkout this python tutorial for more details

- Boruta Method (Using shadow features and random forest)
Embedded Methods

It combines the qualities of filter and wrapper methods. It's implemented by algorithms that have their own built-in feature selection methodsWorkflow

Here in the demo code we will do feature selection using Lasso regularization. If the feature is irrelevant, lasso penalizes it's coefficient and make it 0. Hence the features with coefficient = 0 are removed and the rest are taken.

1

2

3

4

5

6

7

8

9

10

11

12

13

14reg = LassoCV()

reg.fit(X, y)

print("Best alpha using built-in LassoCV: %f" % reg.alpha_)

print("Best score using built-in LassoCV: %f" % reg.score(X,y))

in LassoCV: 0.724820 Best alpha using built-

in LassoCV: 0.702444 Best score using built-

coef = pd.Series(reg.coef_, index = X.columns)

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")

10 variables and eliminated the other 3 variables Lasso picked

imp_coef = coef.sort_values()

import matplotlib

matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)

imp_coef.plot(kind = "barh")

plt.title("Feature importance using Lasso Model")

Filter vs Wrapper

Now let us make a comparison between filter methods and wrapper methods, the two most commonly used ways in feature selection.

Characteristics Filter Method Wrapper Methods Measure of feature relevance correlation with dependent variable actually training a model on a subset of feature Speed Much faster Slower due to model training Performance Evaluation statistical methods for evaluation Model results cross validation Quality of feature set selected May be suboptimal Guaranteed to output optimal/near-optimal feature set Overfitting ? Less likely Much more prone to

### Model Selection

Here we must clarify one important conceptual misunderstanding:

**Note**: Classical Model selection mainly focuses on performing metrics evaluations through different models, tuning the model parameter and variating the training datasets. The choice of model in the end is often *manual*. Hence, it differs from the automated model selection procedure where the final selection of model is also done automatically. The latter is often known as AutoML, and has gained quick wide popularity in recent years.

We now think about what are the main strategies to improve model performance:

- Use a more complicated/more flexible model
- Use a less complicated/less flexible model
- Tuning hyperparameters
- Gather more training samples
- Gather more data to add features to each sample

Clearly, the first 4 are model selection strategies, and the last one is feature selection.

When we make these adjustments, we must keep in mind the `The Bias-variance trade-off`

:

`bias`

: Usually the case where the model`underfits`

, i.e. it does not have enough model flexibility to suitably account for all the features in the data`variance`

: Usually the case where the model`overfits`

, i.e. so much model flexibility that the model ends up accounting for random errors as well as the underlying data distribution- For high-bias models, the performance of the model on the validation set is similar to the performance on the training set.
- For high-variance models, the performance of the model on the validation set is far worse than the performance on the training set.

We can easily visualize this via the **learning curve**

In the meantime, we observe from the **validation curve** below that model complexity/hyperparameter choices affect the model performances as well

For more details on metrics evaluation and hyperparameter tuning with feedback from validation sets, interested readers can read my blogs on these topics as well.

Feature Selection & Model Selections

https://criss-wang.github.io/post/blogs/mlops/feature-and-model-selections/