Blogs · Matrix Computation · Dimensionality Reduction · Unsupervised Learning

Dimension Reduction: Life savers

We must dispel the curse of dimensionality!

2020.03.04 · 7 min read · by Zhenlin Wang · updated 2022-01-11

Overview

High dimensional data modeling has always been a popular topic of discussion. Many research and work are done in this field simply because we have limited computational power. Even if quantum technology can greatly boost this power in near future, we will still face the curse of unlimited flow of data and features. Thus, it’s actually extremely important that we reduce the amount of data input in a model.

In this blog, we explore several well-known tools for dimensionality reduction, namly Linear Discriminant Analysis (LDA), Principle Component Analysis (PCA) and Nonnegative Matrix Factorization (NMF).

LDA

1. Definition

2. Pros & Cons

Pros

Cons

3. Application

4. Code implementation

from sklearn.datasets import load_wine
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Loading Data
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Categorical.from_codes(wine.target, wine.target_names)
# Merge X and y (Training set)
df = X.join(pd.Series(y, name='class'))

## The Simply Way: using sklearn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X, y)

## The hard way: Understand the logic here
# Compute means for each class
class_feature_means = pd.DataFrame(columns=wine.target_names)
for c, rows in df.groupby('class'):
    class_feature_means[c] = rows.mean()
class_feature_means

# Compute the Within Class Scatter Matrix using the Mean Vector
within_class_scatter_matrix = np.zeros((13,13))
for c, rows in df.groupby('class'):
    rows = rows.drop(['class'], axis=1)

s = np.zeros((13,13))
for index, row in rows.iterrows():
        x, mc = row.values.reshape(13,1), class_feature_means[c].values.reshape(13,1)

        s += (x - mc).dot((x - mc).T)

        within_class_scatter_matrix += s

# Compute the Between Class Scatter Matrix:
feature_means = df.mean()
between_class_scatter_matrix = np.zeros((13,13))
for c in class_feature_means:
    n = len(df.loc[df['class'] == c].index)

    mc, m = class_feature_means[c].values.reshape(13,1), feature_means.values.reshape(13,1)

    between_class_scatter_matrix += n * (mc - m).dot((mc - m).T)

# Compute the Eigenvalues & Eigenvectors, then sort accordingly

eigen_values, eigen_vectors = np.linalg.eig(np.linalg.inv(within_class_scatter_matrix).dot(between_class_scatter_matrix))

pairs = [(np.abs(eigen_values[i]), eigen_vectors[:,i]) for i in range(len(eigen_values))]
pairs = sorted(pairs, key=lambda x: x[0], reverse=True)

# Print the Explained Variance
eigen_value_sums = sum(eigen_values)
print('Explained Variance')
for i, pair in enumerate(pairs):
    print('Eigenvector {}: {}'.format(i, (pair[0]/eigen_value_sums).real))


# Identify the Principle Eigenvalues (here k = 2); Compute the new feature space X_lda
w_matrix = np.hstack((pairs[0][1].reshape(13,1), pairs[1][1].reshape(13,1))).real
X_lda = np.array(X.dot(w_matrix))

le = LabelEncoder()
y = le.fit_transform(df['class']) # Here the y is just the encoded label set

# Visualize
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.scatter(
    X_lda[:,0],
    X_lda[:,1],
    c=y,
    cmap='rainbow',
    alpha=0.7,
    edgecolors='b'
)

PCA

1. Definition

2. Pros & Cons

Pros

Cons

3. Application

4. Comparison

PCA vs LDA

PCA vs ICA

5. Steps

  1. Standardize the data to ensure that all variables have a mean of 0 and a standard deviation of 1.
  2. Compute covariance matrix: This matrix shows how each variable is related to every other variable in the dataset
  3. Eigen-decomp + feature selection
  4. Projection / transformation

6. Code implementation

  1. sklearn version
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style='white')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from sklearn import decomposition
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D

digits = datasets.load_digits()
X = digits.data
y = digits.target

# f, axes = plt.subplots(5, 2, sharey=True, figsize=(16,6))
plt.figure(figsize=(16, 6))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X[i,:].reshape([8,8]), cmap='gray')

pca = decomposition.PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print('Projecting %d-dimensional data to 2D' % X.shape[1])

plt.figure(figsize=(12,10))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y,
            edgecolor='none', alpha=0.7, s=40,
            cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar()
plt.title('MNIST. PCA projection')

pca = decomposition.PCA().fit(X)

plt.figure(figsize=(10,7))
plt.plot(np.cumsum(pca.explained_variance_ratio_), color='k', lw=2)
plt.xlabel('Number of components')
plt.ylabel('Total explained variance')
plt.xlim(0, 63)
plt.yticks(np.arange(0, 1.1, 0.1))
plt.axvline(21, c='b')
plt.axhline(0.9, c='r')
plt.show()
  1. numpy version
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rnd

# Create random 2d data
mu = np.array([10,13])
sigma = np.array([[3.5, -1.8], [-1.8,3.5]])

print("Mu ", mu.shape)
print("Sigma ", sigma.shape)

# Create 1000 samples using mean and sigma
x = rnd.multivariate_normal(mu, sigma, size=(1000))
print("Data shape ", x.shape)

# normalize / standardize
x = (x - x.mean(axis=0, keepdims=True)) / x.std(axis=0, keepdims=True)

# compute covariance
cov = np.cov(x.T)  # IMPT: don't forget the transpose here
eig_val, eig_vec = np.linalg.eig(cov)

indices = np.argsort(eig_val * -1)
eig_val = eig_val[indices]
eig_vec = eig_vec[:,indices]

print("Sorted Eigen vectors ", eig_vec)
print("Sorted Eigen values ", eig_val, "\n")

# Get explained variance
sum_eig_val = np.sum(eig_val)
explained_variance = eig_val / sum_eig_val
print(explained_variance)
cumulative_variance = np.cumsum(explained_variance)
print(cumulative_variance)

pca_data = np.dot(x, eig_vec)
print("Transformed data ", pca_data.shape)

# Reverse PCA transformation
recon_data = pca_data.dot(eig_vec.T) * var + mean

NMF

1. Definition

2. Application

3. Code Implementation

from operator import itemgetter
from concurrent.futures import ProcessPoolExecutor
from time import time
import os
import gensim
import pandas as pd
import itertools
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer, LabelEncoder

lemmatizer = WordNetLemmatizer()

data_path = 'matrix_factorization_arxiv_query_result.json'
articles_df = pd.read_json(data_path)
articles_df.head()


def stem(text):
    return lemmatizer.lemmatize(text)


def map_parallel(f, iterable, **kwargs):
    with ProcessPoolExecutor() as pool:
        result = pool.map(f, iterable, **kwargs)
    return result


def retrieve_articles(start, chunksize=1000):
    return arxiv.query(
        search_query=search_query,
        start=start,
        max_results=chunksize
    )

def vectorize_text(examples_df, vectorized_column='summary', vectorizer=CountVectorizer):

    vectorizer = vectorizer(min_df=2)
    features = vectorizer.fit_transform(examples_df[vectorized_column])

    le = LabelEncoder()
    ohe = OneHotEncoder()
    labels = le.fit_transform(valid_example_categories).reshape(-1, 1)
    labels_ohe = ohe.fit_transform(labels).todense()
    vectorized_data = {
        'features': features,
        'labels': labels,
        'labels_onehot' : labels_ohe
    }
    return vectorized_data, (vectorizer, ohe, le)


def extract_keywords(text):
    """
    Use gensim's textrank-based approach
    """
    return gensim.summarization.keywords(
        text=stem(text),
        lemmatize=True
    )

def filter_out_small_categories(df, categories, threshold=200):

    class_counts = categories.value_counts()
    too_small_classes = class_counts[class_counts < threshold].index
    too_small_classes

    valid_example_indices = ~categories.isin(too_small_classes)
    valid_examples = df[valid_example_indices]
    valid_example_categories = categories[valid_example_indices]

    return valid_examples, valid_example_categories

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % (topic_idx + 1)
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

categories = articles_df['arxiv_primary_category'].apply(itemgetter('term'))

main_categories = categories.apply(lambda s: s.split('.')[0].split('-')[0])

main_categories_counts = main_categories.value_counts(ascending=True)
main_categories_counts.plot.barh()
plt.show()

main_categories_counts[main_categories_counts > 200].plot.barh()
plt.show()

categories.value_counts(ascending=True)[-10:].plot.barh()
plt.show()


articles_df['summary_keywords'] = [extract_keywords(i) for i in articles_df['summary']]

article_keyword_lengths = articles_df['summary_keywords'].apply(lambda kws: len(kws.split('\n')))

valid_examples, valid_example_categories = filter_out_small_categories(articles_df, main_categories)

vectorized_data, (vectorizer, ohe, le) = vectorize_text(
    valid_examples,
    vectorized_column='summary_keywords',
    vectorizer=TfidfVectorizer
)

x_train, x_test, y_train, y_test, y_train_labels, y_test_labels = train_test_split(
    vectorized_data['features'],
    vectorized_data['labels_onehot'],
    vectorized_data['labels'],
    stratify=vectorized_data['labels'],
    test_size=0.2,
    random_state=0
)

nmf = NMF(n_components=5, solver='mu', beta_loss='kullback-leibler')

topics = nmf.fit_transform(x_train)

n_top_words = 10

tfidf_feature_names = vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

dominant_topics = topics.argmax(axis=1) + 1
categories = le.inverse_transform(y_train_labels[:,0])
pd.crosstab(dominant_topics, categories)