Skip to main content

numpy library for linear dimension reduction

Project description

pydiodon

What it is

numpy library for linear dimension reduction, part of diodon project. Plotting the results is provided with matplotlib.

Companions

Here are five companions of this library, provided while installing pydiodon, and one companion gitlab:

  • the online documentation, available at https://diodon.gitlabpages.inria.fr/pydiodon/
  • a directory datasets with some datasets which are used for tutorials
  • a directory jupyter with jupyter notebooks, as tutorials (currently for running PCA, CoA, MDS), see file jupyter.md
  • a directory demos with python programs which can be used as demonstrations or tutorials (currently for PCA, CoA, MDS)
  • a DSL (Domain Specific Language) written in python, called diosh.py which makes the use of pydiodon friendly (no line of python code to write) ; its documentation is available in file diosh.md. What it can do is shown in a gallery
  • the presentation of the methods, from linear algebra to pseudocodes, available at https://arxiv.org/abs/2209.13597

The methods are made available in C++ for very large datasets, with distributed memory, task based programming, in gitlab https://gitlab.inria.fr/diodon/cppdiodon

Overview

The library provides functions to call most common linear dimension reduction methods, currently

  • PCA (Principal Component Analysis)
  • CoA (Correspondence Analysis)
  • MDS (Multidimensional Scaling)

Those three can be considered as parts of the release.

Other methods have been (sometimes partially only) coded too, but tests are ongoing and the result is not garanteed, like:

  • PCA-IV (PCA with Instrulental Variables, equivalent to PLS)
  • PCAmet (PCA with metrics on spaces spanned by the rows and the columns)

or are currently under development for further release

  • CCA (Canonical Correlation Analysis)
  • MCoA (Multiple Correspondence Analysis)
  • MCCA (Multiple Canonical Correspondence Analysis)

Finally, a few tools are available (like plotting or computing indices) to facilitate the interpretaton of the results, like:

  • plotting components for PCA or MDS, and simultaneous plotting of row and columns components for CoA
  • plotting old variables in the space spanned by new axis, for PCA
  • plotting the eigenvalues for PCA
  • plotting the quality of projection of each item on each new axis, and cumuilated values, for PCA and MDS.

Install

The installation procedure is given for Linux Ubunto 20 and above. The user must have pip (version pip3 ) on his/her computer, as the command pip install [...] will be used.

Diodon is written in python 3.10. The user must have python 3.8 or up in her/his computer. The following python scientific librairies will be used:

  • numpy
  • scipy

and, for plots:

  • matplotlib.pyplot

Installation of pydiodon is fairly simple. The required dependencies will be installed when installing pydiodonas follows (to install the last version of pydiodon from the git repository):

git clone https://gitlab.inria.fr/diodon/pydiodon.git
pip install .

Checking the installation

To check that intallation has been succesful, open a terminal and type

# call python3 interpreter
python3

Then you have access to the python interactive console, where you can type

# import pydiodon as in
>>> import pydiodon as dio

The following information should be displayed

loading pydiodon - version 23.05.04

Online sphinx documentation

pydiodon has an online sphix documentation (per function) accessible at

https://diodon.gitlabpages.inria.fr/pydiodon/

To get started ..

Here is a simple toy example of Principal Components Analysis on a small random matrix.

First, create a toy matrix:

# importing library
>>> import numpy as np # for creating the random matrix
>>> import pydiodon as dio
# creating a random matrix
>>> m = 100
>>> n = 50
>>> A = np.random.randn(m,n)

Then, the diodon command to perform PCA:

# running PCA
>>> Y, L, V = dio.pca(A)

# this is the command with default values; see the documentation for more options

Followed by a few functions for plotting the results

# plotting the results
>>> dio.plot_components_scatter(Y, dot_size=5, title="Principal components")
>>> dio.plot_var(V, varnames=None)
# and the quality of the results
>>> dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
>>> Qual_axis, Qual_cum = dio.quality(Y)
>>> dio.plot_components_quality(Y, Qual_cum, r=2)

Why another library for Linear Dimension Reduction?

There exists several excellent libraries for PCA and related methods, especially in R, or some methods in Scikit-learn in python (see https://scikit-learn.org/stable/modules/decomposition.html#decompositions).

A specific effort has been made for efficiency when analysing large datasets, and motivates the development and disseminatuon of library Diodon. The limiting factors are currently:

  • the time for I/O
  • the available RAM and not the calculation time. The effort has focused on computing the SVD of a given matrix, which is a key step providing the results for any method.

Progresses in efficiency have been obtained through three choices, available when useful:

  • use Random projection methods for computing the SVD of a large matrix
  • bind numpy calls of functions with codes written in C++ with xxxx
  • task based programming with Chameleon (for MDS only, on HPC architectures with distributed memory)

Using random projection methods is not new here. See e.g.https://scikit-learn.org/stable/modules/random_projection.html in scikit learn. In diodon, Gaussian Random Projectkon only has been implemented.

For a presentation of the methods, see

  • A. Franc - Linear Dimensionality Reduction, Lecture Notes in Statistics, vol 228 Springer, 2025.

For the connection between MDS and rSVD, see

  • P. Blanchard, P. Chaumeil, J.-M. Frigerio, F. Rimet, F. Salin, S. Thérond, O. Coulaud, and A. Franc. A geometric view of Biodiversity: scaling to metage- nomics. Research Report RR-9144, INRIA ; INRA, January 2018

For development of this approach with task based programming, distributed memory and chameleon, see

  • E. Agullo, O. Coulaud, A. Denis, M. Faverge, A. Franc, J.-M. Frigerio, N. Furmento, A. Guilbaud, E. Jeannot, R. Peressoni, F. Pruvost, and S. Thibault. Task-based randomized singular value decomposition and multidimensional scaling. Research Report RR-9482, Inria Bordeaux - Sud Ouest ; Inrae - BioGeCo, September 2022.

Datasets for tutorials

Some small data sets are available for tutorials, demos or Jupyter notebooks for PCA, CoA and MDS.

See the documentation in file datasets.md

ID card

authors: Alain Franc & Jean-Marc Frigerio

contributors:

  • Olivier Coulaud
  • Violaine Louvet
  • Romain Peressoni
  • Florent Pruvost

maintainer and contact: Alain Franc

mail: alain.franc@inria.fr

started: 21/02/17 version: 25.04.04 release: 0.1.0

licence: GPL-3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydiodon-0.1.0-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file pydiodon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pydiodon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pydiodon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 246bf437a03046d017af1e64cf920accfacee062732fe07776d853704d6ce40c
MD5 395cfe83be6018dfeae608b727646738
BLAKE2b-256 f6bae5b8bb707d9bf1acc383b9f24d3d409041adb1b1e97e59f776ff76e05bbf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page