Dimension Reduction and Visualization with PCA SVD, EVD and more
Project description
DimRed - Dimension Reduction Package
DimRed Introduction
DimRed is a python package that enables Dimension Reduction leveraging various algorithms with the default being PCA (Principal Component Analysis). The algorithms supported so far are:
- numpy
EVD
,SVD
- sklearn
PCA
,SparsePCA
andTruncatedSVD
.
This package also offers some visualization capabilities to explore the principal components (up to 2 or 3 PC, in 2D or 3D).
DimRed has some built-in functions written in numpy
and others leveraging the well known sklearn
built-in functions:
- internally built SVD and EVD methods with
numpy
: dimred_svd
- Dimension reduction using the Singular Value Decomposition:X . V = U . S ==> X = U.S.Vt
This should return the same results assklearn_pca
and it usesnp.linalg.svd
dimred_evd
- Dimension reduction using the Eigen Value Decomposition, based onC
being the covariance matrix of XC = XT x X / (n -1)
andC = Q Λ QT
whereΛ
is a diagonal matrix with eigenvalues in decreasing order on the diagonal. It usesnp.linalg.eig
sklearn.decomposition
algorithms:sklearn_pca
- leverages sklearn PCA() that is a Linear dimension reduction function that uses SVD. This should return the same results as numpy based internal implementation of SVD:dimred_svd
sklearn_sparse_pca
- using sklearn SparsePCA() also great for Sparse matrices that are not of typescipy.sparse
sklearn_truncated_svd
- leverages sklearn TruncatedSVD() - great for handling sparse matrices (with lots of 0), that are typescipy.sparse
(X.sp_issparse
is True).
Here is an example with PCA
(Principal Component Analysis) that is using a Linear Dimension reduction algorithm to project your data to a lower dimensional space. It is an unsupervised technique for feature extraction
by combining input variables in a specific way so that the output "new" variables (or components) are all independent of one another
.
PCA aims to find linearly uncorrelated orthogonal axes, which are also known as principal components (PCs) in the m dimensional space to project the data points onto those PCs. The first PC captures the largest variance in the data. Let’s intuitively understand PCA by fitting it on a 2-D data matrix, which can be conveniently represented by a 2-D scatter plot:
Making sense of PCA by fitting on a 2-D dataset (source)
And here is an example of dimension reduction on the famous iris dataset and using the DimRed
package:
# Load the data
import matplotlib.pyplot as plt
iris = datasets.load_iris()
features = iris.data
target = iris.target
Dimension Reduction Example - 2D plot / 2 Components:
# Reduce it to 2 principal components
dimred = DimRed(algo='dimred_svd', n_components=2)
X_transf = dimred.fit_transform(X)
# Plot with DimRed - 2d
fig, ax = dimred.draw_scatterplot(X_transf2, y=target,
PC=2,
title='Reduced Iris Dataset with DimRed - 2 principal components',
figsize=(8, 6),
legend=True)
plt.show()
Scatter Plot pf Iris Dataset reduced to 2 components with DimRed
Dimension Reduction Example - 2D plot / 3 Components (using the 3rd component as bubble size):
# Reduce it to 3 principal components
dimred = DimRed(algo='dimred_svd', n_components=3)
X_transf = dimred.fit_transform(X)
# Plot with DimRed - 2d
fig, ax = dimred.draw_scatterplot(X_transf, y=target,
PC=3,
title='Reduced Iris Dataset with DimRed - 3 principal components',
figsize=(8, 6),
legend=True)
plt.show()
Scatter Plot of Iris Dataset reduced to 2 components with DimRed
Dimension Reduction Example - Cumulative Variance:
Below we will reduce the MNIST
dataset to retain 60% of the variance from the original 64 dimensions (8 x 8 pixels)
digits = load_digits(as_frame=True)
X = digits.data
scaler = StandardScaler()
scaler.fit(X)
dimred = DimRed(algo='dimred_svd', n_components = .60)
X_pca = dimred.fit_transform(X)
fig, ax = dimred.draw_varianceplot('MNIST Data')
plt.show()
Cumulative Variance Plot of MNIST Dataset based on 60% Variance retained with DimRed
Dimension Reduction Example - 3D plot:
# Reduce it to 3 principal components
dimred = DimRed(algo='dimred_svd', n_components=3)
X_transf = dimred.fit_transform(X)
# Plot with DimRed - 3d
# give larger values for bubble plot from [1-10] range to [100-1000]
fig, ax = dimred.draw_scatterplot(X_transf, y=target,
PC=3,
title='Reduced Iris Dataset with DimRed - 3 principal components ',
figsize=(8, 6),
legend=True,
dim3=True)
plt.show()
Scatter Plot pf Iris Dataset reduced to 2 components with DimRed
Table of contents
- Refresher on Dimension Reduction
- DimRed Installation
- DimRed Examples
- Dimension Reduction Notebooks
- Dimension Reduction Algorithms - Intuition and Mathematics
- Resources
1. DimRed Installation
DimRed
is built as a python package.
You can install it from Pypi as a pip installable package.
Or you can also download the code and do a local install.
1.1 Pip Install
You need to run Python 3.X.
And you should set up a virtual environment with conda
or virtualenv
Run:
pip install -i https://pypi.org/simple/ dimred
1.2 Local Install
You need to run Python 3.X.
And you should set up a virtual environment with conda
or virtualenv
Go to: dimred pypi
Click on Download files link
And download either the Wheel or the Source code.
To locally install from Source code, run:
> pip install -r requirements
Finally, don't forget to set you $PYTHONPATH
variable to the root of your projects if you want to run the tests.
> export PYTHONPATH=/to/root/of/your/project
It should map to: /your/path/dimred/dimred
Tests
For Unit Tests, run:
> pytest
Don't forget to set your $PYTHONPATH
to the root of your project
If you also want to see the print output to stdout, run:
> pytest --capture=tee-sys
For Unit Tests Coverage, run:
> pytest --cov=dimred tests/
Or:
> pytest --capture=tee-sys --cov=dimred tests/
To run a particular test, run:
> pytest --capture=tee-sys --cov=dimred tests/ -k '<your test>'
We should aim at having a minimum of 90% code coverage, and preferably closer or equal to 100%.
Packaging and uploading to PiPy
See dimred-packaging
2. DimRed Examples
2.1 DimRed on Iris dataset (automatic selection)
Reducing the (150x4) iris matrix to (150x2) with DimRed
letting the algorithm pick the right algorithm (in that case sklearn_pca
which is the default algorithm):
from dimred import DimRed
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
dimred = DimRed(n_components=2)
X_pca = dimred.fit_transform(X)
# Algorithm selected for Dimension Reduction
dimred.algo
> 'sklearn_pca'
# Matrices shape of both Input matrix and Reduced matrix
X.shape
> (150, 4)
X_pca.shape
> (150, 2)
# Number of components.
# If not specified (as we did here with 2), it is estimated from input data
dimred.n_components_
> 2
# Principal axes in feature space, representing the directions of maximum variance in the data.
# The components are sorted by `explained_variance_`.
dimred.components_
> array([[ 0.36138659, -0.08452251, 0.85667061, 0.3582892 ],
[ 0.65658877, 0.73016143, -0.17337266, -0.07548102]])
# Amount of variance explained by each of the selected components.
dimred.explained_variance_
> array([4.22824171, 0.24267075])
# Percentage of variance explained by each of the selected components.
dimred.explained_variance_ratio_
> array([0.92461872, 0.05306648])
2.2a DimRed on Friedman Sparse dataset (automatic selection)
Reducing the (30x30) sparse matrix to (30x5) with DimRed
letting the algorithm pick the right algorithm (in that case sklearn_sparse_pca
which is using sklearn SparsePCA()
).
from dimred import DimRed
from sklearn.datasets import make_sparse_spd_matrix
X = make_sparse_spd_matrix(dim=30, alpha = .95, random_state=10)
dimred = DimRed(n_components=5, random_int=0)
X_transformed = dimred.fit_transform(X)
# Check the algorithm automatically picked is `SparsePCA`
dimred.algo
> 'sklearn_sparse_pca'
X.shape
> (30, 30)
X_pca.shape
> (30, 5))
2.2b DimRed on Friedman Sparse dataset (forced selection)
Reducing the (30x30) sparse matrix to (30x5) with DimRed
specifying the (in that case sklearn_pca
which is using Singular Value Decomposition).
from dimred import DimRed
from sklearn.datasets import make_sparse_spd_matrix
X = make_sparse_spd_matrix(dim=30, alpha = .95, random_state=10)
dimred = DimRed(algo = 'sklearn_pca', n_components=5, random_int=0)
#dimred = DimRed(algo = 'dimred_svd', n_components=5, random_int=0)
X_pca = dimred.fit_transform(X)
# Check the algorithm automatically picked is `SparsePCA`
dimred.algo
> 'sklearn_pca'
X.shape
> (30, 30)
X_pca.shape
> (30, 5))
3. Refresher on Dimension Reduction
Dimension reduction (or Dimensionality reduction) refers to techniques for reducing the number of input variables in training data.
When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.
— Page 11, Machine Learning: A Probabilistic Perspective, 2012.
High-dimensionality might mean hundreds, thousands, or even millions of input variables.
Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and may not perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.
Why is Dimension Reduction useful?
- Reduces training time — due to smaller dataset
- Removes noise — by keeping only what’s relevant
- Makes visualization possible — in cases where you have a maximum of 3 principal components
4. Dimension Reduction Notebooks
- DimRed Demo notebook
- PCA implementation with EVD and SVD => provides implementation of PCA with EVD and SVD and shows SVD is a better implementation
- PCA vs LDA and PCA visualization on Iris data
5. Dimension Reduction Algorithms - Intuition and Mathematics
5.1 Dimensionality Reduction Algorithms
There are many algorithms that can be used for dimensionality reduction.
Two main classes of methods are those drawn from linear algebra and those drawn from manifold learning:
=> Linear Algebra Methods
Matrix factorization methods drawn from the field of linear algebra can be used for dimensionality. Some of the more popular methods include:
- PCA: Principal Components Analysis => process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
- SVD: Singular Value Decomposition
- NMF: Non-Negative Matrix Factorization
For more on matrix factorization, see this tutorial
=> Manifold Learning Methods
Manifold learning methods seek a lower-dimensional projection of high dimensional input that captures the salient properties of the input data.
Some of the more popular methods include:
- Isomap Embedding
- LLE: Locally Linear Embedding
- MDS: Multidimensional Scaling
- Spectral Embedding
- t-SNE: t-distributed Stochastic Neighbor Embedding
Each algorithm offers a different approach to the challenge of discovering natural relationships in data at lower dimensions.
There is no best dimensionality reduction algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.
5.2 DimRed Package - Supported Algorithms
PCA
When using PCA
(Principal Component Analysis), you are using a Linear Dimension reduction algorithm, that will project your data to a lower dimensional space. It is an unsupervised technique for feature extraction
by combining input variables in a specific way so that the output "new" variables (or components) are all independant of one another
. This is a benefit because of the assumptions of a linear model.
PCA aims to find linearly uncorrelated orthogonal axes, which are also known as principal components (PCs) in the m dimensional space to project the data points onto those PCs. The first PC captures the largest variance in the data.
Let’s intuitively understand PCA by fitting it on a 3-D data matrix, which can be conveniently represented by a 3-D scatter plot:
3D to 2D dimension reduction with PCA (source)
Here you can see:
- first image has three dimensional data with
X
,Y
andZ
axes. - second image is a two dimensional space with
PC1
andPC2
as axes.
Note these PC1
and PC2
are not our regular dimensions and we cannot name them with any of the previous attribute names. They represent the orthogonal projections along which the variance of data is high. We will understand more about it while dealing with PCA below.
The PCs can be determined via eigen decomposition of the covariance matrix C. After all, the geometrical meaning of eigen decomposition is to find a new coordinate system of the eigenvectors for C through rotations. Image for post Eigendecomposition of the covariance matrix C:
In the equation above, the covariance matrix C(m×m) is decomposed to a matrix of eigenvectors W(m×m) and a diagonal matrix of m eigenvalues Λ. The eigenvectors, which are the column vectors in W, are in fact the PCs we are seeking. We can then use matrix multiplication to project the data onto the PC space. For the purpose of dimensionality reduction, we can project the data points onto the first k PCs as the representation of the data:
Notes: PCA is an analysis approach. You can do PCA using SVD, or you can do PCA doing the eigen-decomposition, or you can do PCA using many other methods. In fact, most implementations of PCA actually use performs SVD under the hood rather than doing eigen decomposition on the covariance matrix because SVD can be much more efficient and is able to handle sparse matrices. In addition, there are reduced forms of SVD which are even more economic to compute.
From a high-level view PCA using EVD (eigen-decomposition) has three main steps:
- (1) Compute the covariance matrix of the data
- (2) Compute the eigen values and vectors of this covariance matrix
- (3) Use the eigen values and vectors to select only the most important feature vectors and then transform your data onto those vectors for reduced dimensionality!
PCA can be very easily implemented with numpy as the key function performing eigen decomposition np.linalg.eig
is already built-in:
def pca(X):
# Data matrix X, assumes 0-centered
n, m = X.shape
assert np.allclose(X.mean(axis=0), np.zeros(m))
# Compute covariance matrix
C = np.dot(X.T, X) / (n-1)
# Eigen decomposition
eigen_vals, eigen_vecs = np.linalg.eig(C)
# Project X onto PC space
X_pca = np.dot(X, eigen_vecs)
return X_pca
LDA
Both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised – PCA ignores class labels.
LDA is very useful to find dimensions which aim at separating cluster, thus you will have to know clusters before. LDA is not necessarily a classifier, but can be used as one. Thus LDA can only be used in supervised learning
=> *LDA is used to carve up multidimensional space.*LDA is for classification, it almost always outperforms Logistic Regression when modeling small data with well separated clusters. It also handles multi-class data and class imbalances.
To contrast with LDA, PCA is a general approach for denoising and dimensionality reduction and does not require any further information such as class labels in supervised learning. Therefore it can be used in unsupervised learning.
=> PCA is used to collapse multidimensional space. PCA allows the collapsing of hundreds of spatial dimensions into a handful of lower spatial dimensions while preserving 70% - 90% of the important information. 3D objects cast 2D shadows. We can see the shape of an object from it's shadow. But we can't know everything about the shape from a single shadow. By having a small collection of shadows from different (globally optimal) angles, then we can know most things about the shape of an object. PCA helps reduce the 'Curse of Dimensionality' when modeling.
EVD and SVD
SVD - Singular Value Decomposition
The SVD allows to describe the effect of a matrix 𝐴 on a vector (via the matrix-vector product), as a three-step process 𝐴=𝑈Σ𝑉†
:
-
- A first rotation in the input space (
𝑉
)
- A first rotation in the input space (
-
- A simple positive scaling that takes a vector in the input space to the output space (
Σ
)
- A simple positive scaling that takes a vector in the input space to the output space (
-
- And another rotation in the output space (
𝑈
)
- And another rotation in the output space (
Note that 𝑉†
denotes the conjugate of 𝑉⊤
, hence the two are equal when 𝑉 is real.
Note that the conditions above are mathematically the following constraints:
𝑉†𝑉=𝐼
(i.e. 𝑉 is a rotation matrix)Σ=diag(𝜎⃗ )
and𝜎⃗ ≥0⃗
(diag
just returns a diagonal matrix with the given diagonal)𝑈†𝑈=𝐼
(i.e. 𝑈 is a rotation matrix)
The fundamental theorem of linear algebra says that such a decomposition always exists.
SVD is another decomposition method for both real and complex matrices. It decomposes a matrix into the product of two unitary matrices (U, V*) and a rectangular diagonal matrix of singular values (Σ):
Here is a friendler way to visualize the SVD formula:
Illustration of SVD (source)
In most cases, we work with real matrix X, and the resultant unitary matrices U and V will also be real matrices. Hence, the conjugate transpose of the U is simply the regular transpose.
What SVD it used for?
Wikipedia has a nice list, but I'll list a couple.
- One of the most common applications is obtaining a low-rank approximation to a matrix (see PCA), which is used for compression, speed-up, and also actual data analysis.
- The other one is for characterizing the pseudo-inverse for analysis or proofs, since inverting it automatically gives a formula that's the inverse when the matrix is invertible, and the pseudo-inverse when it is not.
SVD has also already been implemented in numpy as np.linalg.svd
. To use SVD to transform your data:
def svd(X):
# Data matrix X, X doesn't need to be 0-centered
n, m = X.shape
# Compute full SVD
U, Sigma, Vh = np.linalg.svd(X,
full_matrices=False, # It's not necessary to compute the full matrix of U or V
compute_uv=True)
# Transform X with SVD components
X_svd = np.dot(U, np.diag(Sigma))
return X_svd
Relationship between PCA and SVD
PCA and SVD are closely related approaches and can be both applied to decompose any rectangular matrices. We can look into their relationship by performing SVD on the covariance matrix C:
From the above derivation, we notice that the result is in the same form with eigen decomposition of C, we can easily see the relationship between singular values (Σ) and eigenvalues (Λ):
To confirm that with numpy:
# Compute covariance matrix
C = np.dot(X.T, X) / (n-1)
# Eigen decomposition
eigen_vals, eigen_vecs = np.linalg.eig(C)
# SVD
U, Sigma, Vh = np.linalg.svd(X,
full_matrices=False,
compute_uv=True)
# Relationship between singular values and eigen values:
print(np.allclose(np.square(Sigma) / (n - 1), eigen_vals)) # True
So what does this imply?
It suggests that we can actually perform PCA using SVD, or vice versa. In fact, most implementations of PCA actually use performs SVD under the hood rather than doing eigen decomposition on the covariance matrix because SVD can be much more efficient and is able to handle sparse matrices. In addition, there are reduced forms of SVD which are even more economic to compute.
EVD - Eigenvalue (spectral) decomposition
Similarly, for the eigendecomposition (also known as eigenvalue decomposition, spectral decomposition, or diagonalization):
An eigendecomposition describes the effect of a matrix 𝐴 on a vector as a different 3-step process 𝐴=𝑄Λ𝑄−1
:
-
- An invertible linear transformation
(𝑄−1)
- An invertible linear transformation
-
- A scaling
(Λ)
- A scaling
-
- The inverse of the initial transformation
(𝑄)
- The inverse of the initial transformation
Correspondingly, these conditions imply the following constraints:
𝑄
is invertibleΛ=diag(𝜆⃗ )
This decomposition doesn't always exist, but the spectral theorem describes the conditions under which such a decomposition exists.
Note the most basic requirement is that 𝐴
be a square matrix (but this is not enough).
What EVD is used for?
- It gives you the ability to efficiently raise a matrix to a large power: 𝐴𝑛=𝑄Λ𝑛𝑄−1. For this reason (and others) it's used heavily in engineering to, say, efficiently analyze and predict the behavior of a linear dynamical system at a future point in time, since this is much faster than manually exponentiating the matrix directly.
- It's also used to analyze the response of a linear system at different frequencies. (Sinusoids of different frequencies are orthogonal, so you get the orthogonal diagonalizability for free.)
- Furthermore, it's also a problem that repeatedly comes up when solving differential equations analytically.
EVD Vs SVD
Consider the eigendecomposition 𝐴=𝑃𝐷𝑃−1
and SVD 𝐴=𝑈Σ𝑉∗
. Some key differences are as follows,
- The vectors in the eigendecomposition matrix 𝑃 are not necessarily orthogonal, so the change of basis isn't a simple rotation. On the other hand, the vectors in the matrices 𝑈 and 𝑉 in the SVD are orthonormal, so they do represent rotations (and possibly flips).
- In the SVD, the nondiagonal matrices 𝑈 and 𝑉 are not necessairily the inverse of one another. They are usually not related to each other at all. In the eigendecomposition the nondiagonal matrices 𝑃 and 𝑃−1 are inverses of each other.
- In the SVD the entries in the diagonal matrix Σ are all real and nonnegative. In the eigendecomposition, the entries of 𝐷 can be any complex number - negative, positive, imaginary, whatever.
- The SVD always exists for any sort of rectangular or square matrix, whereas the eigendecomposition can only exists for square matrices, and even among square matrices sometimes it doesn't exist.
using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix (in EVD) to begin with, since the formation of 𝐗𝐗⊤ can cause loss of precision
6. Resources
Articles
- scikit learn PCA
- MIT open source pca packate
- PCA and SVD explained with numpy
- Mathematical explanation of PCA and SVD
- Mathematical explanation - how to use SVD to perform PCA
- Tutorial on Principal Component Analysis - White paper
- implement PCA using SVD with sklearn and numpy
- EVD and SVD white paper
- Difference between EVD and SVD
- Dimension reduction algorithms in python
- PCA explained on wine data with chart animation
- Dimensionality Reduction with PCA and t-SNE in R
- What are Eigenvalues and eigenvectors
- PCA with numpy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dimred-0.0.1.tar.gz
.
File metadata
- Download URL: dimred-0.0.1.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.23.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34ad725f11691ce41c3cd109aed2b6248612676c3707182d0b862a01fe8b1e23 |
|
MD5 | 8a4d4caf80506729d51d006877ee553d |
|
BLAKE2b-256 | 39d3480ba145a3e522ed830bdcacaf9cfd4552404106270d16c393f9810d4721 |
File details
Details for the file dimred-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: dimred-0.0.1-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.23.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6653c046d904a28c02b46d77bfce3828cac743b8d736a07486883cfcfd9a5beb |
|
MD5 | 6cb9de9885317635af1783b1a3ac192a |
|
BLAKE2b-256 | 03a9a99abd679bfe8d0beb1145607a0c977ff21780447f80ef3189fc43032f3e |