Dimension Reduction: Life savers
Overview
High dimensional data modeling has always been a popular topic of discussion. Many research and work are done in this field simply because we have limited computational power. Even if quantum technology can greatly boost this power in near future, we will still face the curse of unlimited flow of data and features. Thus, it's actually extremely important that we reduce the amount of data input in a model.
In this blog, we explore several well-known tools for dimensionality reduction, namly Linear Discriminant Analysis (LDA), Principle Component Analysis (PCA) and Nonnegative Matrix Factorization (NMF).
LDA
1. Definition
Can be either a predictive modeling algorithm for multi-class classification or a dimensionality reduction technique
A Generative Learning Algorithm based on labeled data
Assumes Gaussian Distribution for
; Each attribute has the same variance (Mean removal/Feature Engineering with Log/Root functions/Box-Cox transformation needed) The class calculated from the discrimination function
as having the largest value will be the output classification ( ) LDA creates a new axis based on
- Maximize the distance between means
- Minimize the variations within each categories
Procedure
For Dimensionality Reduction (reduced-rank LDA)
- Compute the d-dimensional mean vectors for the different classes from the dataset.
- Compute the scatter matrices (in-between-class and within-class scatter matrix).
- Compute the eigenvectors (e1,e2,…,ed) and corresponding eigenvalues (
) for the scatter matrices. - Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix W (where every column represents an eigenvector).
- Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: Y=XW (where X is a n×d-dimensional matrix representing the n samples, and y are the transformed n×k-dimensional samples in the new subspace).
For classification (Essentially, LDA classifies the sphered data to the closest class mean.)
Perform eigen-decomposition on the pooled covariance matrix
Spheres the data:
to produce an identity covariance matrix in the transformed space Obtain group means in the transformed space:
Classify
according to : where
is the group's prior probability
Extensions to LDA
- Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).
- Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as splines.
- Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.
2. Pros & Cons
Pros
- Need Less Data
- Simple prototype classifier: Distance to the class mean is used, it's simple to interpret.
- The decision boundary is linear: It's simple to implement and the classification is robust.
Cons
- Linear decision boundaries may not adequately separate the classes. Support for more general boundaries is desired.
- In a high-dimensional setting, LDA uses too many parameters. A regularized version of LDA is desired.
- Support for more complex prototype classification is desired.
3. Application
- Bankruptcy prediction
- Facial recognition
4. Code implementation
1 | from sklearn.datasets import load_wine |
PCA
1. Definition
A linear model (base version) aimed at unsupervised dimensionality reduction.
Basic intuition: projecting data onto its orthogonal feature subspace
It is a technique for feature extraction — it combines our input variables in a specific way, then we can drop the "least important" variables while still retaining the most valuable parts of all of the variables (high variance, independent, few number)
Each of the "new" variables after PCA are all independent of one another (due to the linear model assumption)
PCA effectively minimizes error orthogonal to the model itself
It can only be applied to datasets which are linearly separable
Complete procedure
- Standardize the data.
- Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix (by performing eigendecomposition), or perform Singular Value Decomposition (SVD) [SVD preferred due to efficiency and automatic eigenvalue sorting]
- Find the hyperplane that accounts for most of the variation using SVD (that hyperplane represents 1st componet);
- This basically means it tries to minimize the points’ distance to the hyperplane/maximize the distance from the projected point to the origin
- Find the orthogonal subspace, get from the subspace a best hyperplane with largest variation using SVD again (2nd component which is clearly independent of 1st component)
- Repeat to find all the components
- Use the eigenvalues to determine the proportion of variation that each component account for;
- construct the new system using the principle (top k) components; plot the samples using the projections on components as coordinates
- Find the hyperplane that accounts for most of the variation using SVD (that hyperplane represents 1st componet);
- We should keep the component if it contributes substantially to the variation
- It eventually cluster samples that are highly correlated;
It is possible to restore all the samples if each component correspond to one distinct variable (Note that # of PC < # of Samples)
Warning: scaling and centering data is very Important!!!
Kernel PCA
- an extension of PCA into non-linear dataset by project dataset into a higher dimensional feature space using the
kernel trick
(recall SVM) - Some popular kernels are
- Gaussian/RBF
- Polynomial
- Hyperbolic tangent:
- Note that the kernel matrix still need to be normalized for PCA to use
- kernel PCA so kernel PCA will have difficulties if we have lots of data points.
- an extension of PCA into non-linear dataset by project dataset into a higher dimensional feature space using the
Robust PCA
- an extension of PCA to deal with sparsity in the matrix
- It factorizes a matrix into the sum of 2 matrices,
, where is the original matrix, is the low-rank (with lots of redundant information) matrix and is a sparse matrix (In the case of corrupted data, often captures the corrupted entries) - Application: Latent semantic indexing =>
captures all common words while captures all key words that best identify each document. - The minimization is over
subject to . Minimizing L1-norm results in sparse values, while minimizing nuclear norm (sometimes also use Frobenious norm ) leads to sparse singular values (hence low rank)
2. Pros & Cons
Pros
- Removes Correlated Features
- Improves Algorithm Performance
- Reduces Overfitting
Cons
- Independent variables become less interpretable
- Data standardization is must before PCA
- Information Loss (if PCs chosen are not sufficient)
3. Application
- When interpretability is not an issue, use pca
- When the dimension is too large or you want to identify features that are independent, use pca
4. Comparison
PCA vs LDA
- Not as good as LDA in clustering/classification effect, yet idea for Factor analysis
- PCA projects the entire dataset onto a different feature (sub)space, and LDA tries to determine a suitable feature (sub)space in order to distinguish between patterns that belong to different classes
PCA vs ICA
- In PCA the basis you want to find is the one that best explains the variability of your data; In ICA the basis you want to find is the one in which each vector is an independent component of your data (which usually has mixed signals/mixed features)
5. Steps
- Standardize the data to ensure that all variables have a mean of 0 and a standard deviation of 1.
- Compute covariance matrix: This matrix shows how each variable is related to every other variable in the dataset
- Eigen-decomp + feature selection
- Projection / transformation
6. Code implementation
sklearn
version
1 | import numpy as np |
numpy
version
1 | import numpy as np |
NMF
1. Definition
- It automatically extracts sparse and meaningful (easily interpretable) features from a set of nonnegative data vectors.
- We basically factorize
into 2 smaller matrices non-negative and such that and ( low-rank approximate factorizations) - Interpretation of
and : - Basically, we can interpret
to be a weighted sum of some components, where each row in is a component, and each row in contains the weights of each component
- Basically, we can interpret
- Idea of the algo:
- Formalize an objective function and iteratively optimize it
- A local minima is sufficient for the solution
- Objective function to minimize :
Frobenius norm
:w.r.t. s.t. Generalized Kullback-Leibler divergence
:
- Choices of Optimization technique used:
- Coordinate descent (alternative: gradient descent which fix
and optimize , then fix and optimize until tolerance is met) - Multiplicative Update
,
- Coordinate descent (alternative: gradient descent which fix
- Method to choose the optimal factorisation rank,
: - General guideline:
- Trial and error,
- Estimation using SVD based of the decay of the singular values
- Insights from experts
- General guideline:
- Tricks
- Initialization: uses SVD to compute a rough estimate of the matrices
and . If the non-negativity condition did not exist, taking the top k singular values and their corresponding vectors would construct the best rank k estimate, measured by the frobenius norm. Since and must be non-negative, we must slightly modify the vectors we use. - Regularization: Since the
represents weights of a component, it may produces weights that are too high/low. The classical way is to use or regularization losses
- Initialization: uses SVD to compute a rough estimate of the matrices
2. Application
- NMF is suited for tasks where the underlying factors can be interpreted as non-negative
- Image processing
- Topic Modeling
- Text mining
- Hyperspectral unmixing
3. Code Implementation
1 | from operator import itemgetter |
Dimension Reduction: Life savers
https://criss-wang.github.io/post/blogs/mlops/dimentionality-reduction/