Skip to main content

No project description provided

Project description

Dimensionality Reduction Package

Python package for plug and play dimensionality reduction techniques, data clustering and visualization in a reduced space. Using this package, you can reduce and plot according to a target variable your data set with a 3D o 2D chart and a matrix plot, without being worried about to normalize or scale your dataset for the different techniques.

If you like the idea or you find usefull this repo in your job, please leave a star to support this personal project.

The available techniques are:

At the moment the package is not available using pip install <PACKAGE-NAME>.

For the installation from the source code click here.

Each available method returns a Pandas dataframe with number of components selected plus the target column; the first number of components minus one are the components obtained from the dimensionality reduction technique and the last column is the target variable passed as input. Moreover the method creates two figures, the first one is a scatter plot (2D or 3D) of the reduced data points, and the second one is a pair plot. The 2D and the 3D plot is displayed only if the requested number of componenets are respectively 2 and 3.

Summary table

Technique Supervised Dataset Unsupervised Dataset Numerical Feature Categorical Feature
t-SNE v v v x
LDA v x v x
UMAP v v v x
PCA v v v x
FA v v v x
Truncated SVD v v v x
Kernel PCA v v v x
MDS v v v x
Isomap v v v x

Note: since all the proposed techniques can't deal with categorical variable, it is possible to transform categorical variables into numerical one. I recommend to use the One Hot Encoding approach (sklearn.preprocessing.OneHotEncoder) when the categorical variable takes on a large number of values (the unique value of the variable are few). For more information check out this Kaggle article.

t-distributed Stochastic Neighbor Embedding (t-SNE)

Description

It is a nonlinear unsupervised dimensionality reduction statistical method for visualizing high-dimensional data by giving each datanpoint a location in a low-dimensional space, typically two or three-dimensions. This technique finds clusters in data thereby making sure that an embedding preserves the meaning in the data so, t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. This peculiarity leads to retain the major part of the original information on the final output. On the other hand, the parameterization of this method is not so easy and affects the final result, and therefore a good understanding of the parameters for t-SNE is necessary.

In this package, the input parameters are default and for a first exploration of the data is quite good; for more fine tuning of the input parameters I recomend to read this article How to Use t-SNE Effectively.

t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biomedical signal processing [Wikipedia].

Note: t-SNE output is just a projection into a lower dimensional space of the multi-dimensional input space. Thus, the output components are a mixture of the input features and the physical meaning and the measurement unit of the input feature is lost.

References

  1. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  2. Wattenberg, M., Viégas, F., & Johnson, I. (2016). How to use t-SNE effectively. Distill, 1(10), e2.

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.tsne(X, y, n_components=2)

Linear Discriminant Analysis (LDA)

Description

LDA is a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. This dimensionality reduction is a supervised one, so you need to feed the algorithm with the feature matrix and the target column. The dimensionality of the result will be n_classes - 1 [Wikipedia].

LDA explicitly attempts to model the difference between the classes of data.

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis, but at the moment this method is not implemented into the package.

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.lda(X, y)

Uniform Manifold Approximation and Projection (UMAP)

Description

UMAP is nonlinear unsupervised dimensionality reduction technique and it is an effective way for visualizing groups of data points and their relative proximities. UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance [1]. To read me about the math behind this dimensionality reduction method I recommend this page: How UMAP Works.

References

  1. McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.umap(X, y, n_components=2)

Principal Component Analysis (PCA)

Description

PCA is a statistical method, based on eigen-decomposition of the covariance matrix of the data. This process remaps the input data into a lower dimensional space, where each new dimension is a linear combination of the original features; the linear compination is performed in order to maximize the variance of the data in the new dimensional space. The number of the principal components are the same of the features of the dataset, but each pricinpal component retain a certain amount of the original information; thus the first principal components retain the majority of the original information [Wikipedia]. PCA can be considered as a unsupervised linear dimensionality reduction technique.

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.pca(X, y, n_components=2)

Factor Analysis (FA)

Description

Factor Analysis description is arriving!

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.factor_analysis(X, y, n_components=2)

Truncated Singular Value Decomposition (SVD)

Description

Truncated SVD description is arriving!

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.truncated_svd(X, y, n_components=2)

Kernel Principal Component Analysis (PCA)

Description

Kernel PCA description is arriving!

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.kernel_pca(X, y, n_components=2)

Multidimensional Scaling

Description

Multidimensional Scaling description is arriving!

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.multidim_scaling(X, y, n_components=2)

Isometric Mapping (Isomap)

Description

Isomap description is arriving!

Examples

from dimensionality_reduction.dimensionality_reduction import DimensionalityReduction
from sklearn import datasets

dr = DimensionalityReduction()

iris_dataset = datasets.load_iris()

X = iris_dataset.data[:, :3]
y = iris_dataset.target

df = dr.isomap(X, y, n_components=2)

Installation

For the installation from the source code type this command into your terminal window:

pip install git+<repository-link>

or

python -m pip install git+<repository-link>

or

python3 -m pip install git+<repository-link>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dimensionality-reduction-package-1.0.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file dimensionality-reduction-package-1.0.0.tar.gz.

File metadata

File hashes

Hashes for dimensionality-reduction-package-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f58be8c75dffb2f27c10f34de52d59f97dc220f7000986e042fcacebad221d6a
MD5 77704d83823fbaa4d171aa0fc25589c1
BLAKE2b-256 faaffb84675e5e3694299ae2ae61b8c27b02e638bd52503b8a2fa6c68ab3442e

See more details on using hashes here.

File details

Details for the file dimensionality_reduction_package-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dimensionality_reduction_package-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f94b5d8f03927471d1a736f59b738115f31d7f85e4c8db14a6c85a5679c2cbf
MD5 bfa6c1f20fc2c2d04a6cdae001c748cd
BLAKE2b-256 d7a57159484c3efd1fcf5ac21ef80c9334bcbae05975b554ee775d4d7216945f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page