Blogs · Supervised Learning · Classification

Some Supervised Learning Models

So, give me a taste of supervised learning

2019.06.01 · 12 min read · by Zhenlin Wang · updated 2021-08-19

Overview

Almost everyone who learned about data science or machine learning knows what supervised learning is. However, not many have dived deep into the details of those well-known models. In this blog, I will share some critical aspects of these models (mainly mathematical) that will become helpful in both research and practical work. One note on functionality is that these models work for both regression and classification problems.

KNN

1. Definition

2. Choice of K

3. Strength and Weakness

4. Suitable scenario

5. Interview Questions

  1. Use 1 line to describe KNN
    • Answer: KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

6. Simple implementations

## For Regression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix,mean_squared_error,accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

##Randomly generate some data

data  = pd.DataFrame(np.random.randint(low = 2,high = 100,size = (1000, 4)),
                     columns=["Target","A","B","C"])
data.head()

train_x,test_x,train_y,test_y = train_test_split(data.iloc[:,1:],data.Target,test_size = 0.2)
print(train_x.shape, test_x.shape)

scaler = MinMaxScaler(feature_range=(0,1))

scaler.fit(train_x)
scaled_train_x = pd.DataFrame(scaler.transform(train_x),columns=['A','B','C'])
scaled_test_x = pd.DataFrame(scaler.transform(test_x),columns=["A","B","C"])

###Basic Performance testing
knn_regressor = KNeighborsRegressor(n_neighbors=3,algorithm="brute",weights="distance")
knn_regressor.fit(scaled_train_x, train_y)

train_pred = knn_regressor.predict(scaled_train_x)
test_pred = knn_regressor.predict(scaled_test_x)

print(mean_squared_error(train_y,train_pred))
print(mean_squared_error(test_y,test_pred))

###Grid Search to determine K
knn_regressor = KNeighborsRegressor(algorithm="brute",weights="distance")
params = {"n_neighbors": [1,3,5],"metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(knn_regressor,param_grid=params,scoring="neg_mean_squared_error",cv=5)

grid.fit(scaled_train_x, train_y)
print(grid.best_params_)
print(grid.best_score_)

best_knn = grid.best_estimator_
train_pred = best_knn.predict(scaled_train_x)
test_pred = best_knn.predict(scaled_test_x)

SVM

1. Definition

2. Pros & Cons

Pros

Cons

3. Application

4. Simple Implementation

from sklearn.svm import SVC

svc=SVC(kernel='linear') # Choices include 'rbf', 'poly', 'sigmoid'
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))

Decision Tree

1. Definition

2. Types of DT

  1. CHAID (Chi-squared Automatic Interaction Detection)
    • multiway DT
    • chooses the independent variable that has the strongest interaction with the dependent variable.
    • The selection criteria:
      1. For regression: F-test
      2. For classification: chi-square test
    • Has no pruning function
  2. CART (Classification And Regression Tree)
    • binary DT
    • handles data in its raw form (no preprocessing needed),
    • can use the same variables more than once in different parts of the same DT, which may uncover complex interdependencies between sets of variables.
    • The selection metric:
      1. For Classification: Gini Impurity Index
        • $1 - \sum_{i = 0}^{1}(P_i)^2$ where $P_i$ is the % of data with label $i$ in the split
        • The lower value indicates a better spliting
      2. For Regression: Least Square Deviation (LSD)
        • the sum of the squared distances (or deviations) between the observed values and the predicted values.
        • Often refered as ‘sqaured residual’, lower LSD means better split
    • doesn’t use an internal performance measure for Tree selection/testing
  3. Iterative Dichotomiser 3 (ID3)
    • classification DT
    • Entropy:
      • Single Attribute: $E(S) = \sum_{i = 1}^{c} -p_i\log_2 p_i$
      • Multiple Attribute: $E(T,X) = \sum_{c\in X}P(c)E(c)$ where $T$ → Current state and $X$ → Selected attribute
      • The higher the entropy, the harder it is to draw any conclusions from that information.
    • Follows the rule — A branch with an entropy of zero is a leaf node and A brach with entropy more than zero needs further splitting
    • The selection metric:
      • Information Gain: $Gain(before,after) = Entropy(before) - \sum_{j = 1}^{K}Entropy(j,after)$ where $K$ is number of splits and $j$ is a particular split
      • The higher the gain, the better the split
    • Limitation: it can’t handle numeric attributes nor missing values
  4. C4.5
    • The successor of ID3 and represents an improvement in several aspects
      • can handle both continuous and categorical data (regression + classification)
      • can deal with missing values by ignoring instances that include non-existing data
    • The selection metric:
      • Gain ratio: a modification of Information gain that reduces its bias and is usually the best option
      • $Gain ratio(before, after) = \frac{Gain(before,after)}{-\sum_{j = 1}^{K}(P(j)\log_2 P(j))}$
    • Windowing: the algorithm randomly selects a subset of the training data (called a “window”) and builds a DT from that selection.
      • This DT is then used to classify the remaining training data, and if it performs a correct classification, the DT is finished. Otherwise, all the misclassified data points are added to the windows, and the cycle repeats until every instance in the training set is correctly classified by the current DT.
      • It captures all the “rare” instances together with sufficient “ordinary” cases.
    • Can be pruned: pruning method is based on estimating the error rate of every internal node, and replacing it with a leaf node if the estimated error of the leaf is lower.

3. Strength and Weakness

4. Suitable scenario

Consideration:

Use cases:

5. Simple implementations

from sklearn import tree
dt = tree.DecisionTreeClassifier(random_state=1, max_depth=4)
dt.fit(data_train, label_train)
dt_score_train = dt.score(data_train, label_train)  
print("Training score: ",dt_score_train)
dt_score_test = dt.score(data_test, label_test)
print("Testing score: ",dt_score_test)

dt2.predict(data_pred)

Naive Bayes

1. Definition

2. Pros & Cons

Pros

Cons

3. Applications

4. Simple Implementation

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# fit the model with the training data
model.fit(train_x,train_y)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)