Skip to main content

pydiodon

Project description

pydiodon

What it is

numpy library for linear dimension reduction, part of diodon project

Overview

The library provides functions to call most common linear dimension reduction methods, like

  • PCA (Principal Component Analysis)
  • CoA (Correspondence Analysis)
  • MDS (Multidimensional Scaling)

Those three can be considered as parts of the release.

Other methods have been coded too, but tests are ongoing and the result is not garanteed, like:

  • PCA-IV (PCA with Instrulental Variables, equivalent to PLS)
  • PCAmet (PCA with metrics on spaces spanned by the rows and the columns)
  • Can (Canonical Analysis)
  • MCoA (Multiple Correspondence Analysis)
  • MCan (Multiple Canonical Analysis)

Finally, a few tools are available (like plotting or computing indices) to facilitate the interpretaton of the results.

Install

` The installation procedure is given for Linux Ubunto 20 and up.

Diodon is written in python 3.8. Such a version or up must be present on the computer. The following python librairies must be installed:

  • time
  • os
  • sys
  • h5py
  • numpy
  • scipy
  • matplotlib.pyplot

To install pydiodon, the user must have a directory named diodon somewhere in his/her computer. Installation is along the following steps:

# create directory diodon
[...]$mkdir diodon

# go into this directory,
$cd diodon

# clone pydiodon
$git clone git@gitlab.inria.fr:diodon/pydiodon.git

# go into pydiodon subdirectory
$cd pydiodon

# install pydiodon with a setup.py
$sudo python3 setup.py install

To get started ..

Here is a simple toy example of Principal Components Analysis on a small random matrix.

First, create a toy matrix:

# importing library
>>> import numpy as np # for creating the random matrix
>>> import pydiodon as dio
# creating a random matrix
>>> m = 10
>>> n = 5
>>> A = np.random.randn(m,n)

Then, the diodon command to perform PCA:

# running PCA
>>> Y, L, V	= dio.pca(A, pretreatment="standard", k=-1, meth="svd")

Followed by a few functions for plotting the results

# plotting the results
>>> dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
>>> dio.plot_components_scatter(Y, dot_size=5, title="Principal components")
>>> dio.plot_var(V, varnames=None)

Why another library for Linear Dimension Reduction?

There exists several excellent libraries for PCA and related methods, especially in R, or some methods in Scikit-learn in python (see https://scikit-learn.org/stable/modules/decomposition.html#decompositions).

A specific effort has been made for efficiency when analysing large datasets, and motivates the development and disseminatuon of library Diodon. The limiting factors are currently:

  • the time for I/O
  • the available RAM and not the calculation time. The effort has focused on computing the SVD of a given matrix, which is a key step providing the results for any method.

Progresses in efficiency have been obtained through three choices, available when useful:

  • use Random projection methods for computing the SVD of a large matrix
  • bind numpy calls of functions with codes written in C++ with xxxx
  • task based programming with Chameleon (for MDS only, on HPC architectures with distributed memory)

Using random projection methods is not new here. See e.g.https://scikit-learn.org/stable/modules/random_projection.html in scikit learn. In diodon, Gaussian Random Projectkon only has been implemented.

For the connection between MDS and rSVD, see

  • P. Blanchard, P. Chaumeil, J.-M. Frigerio, F. Rimet, F. Salin, S. Thérond, O. Coulaud, and A. Franc. A geometric view of Biodiversity: scaling to metage- nomics. Research Report RR-9144, INRIA ; INRA, January 2018

For development of this approach with task based programming, distributed memory and chameleon, see

  • E. Agullo, O. Coulaud, A. Denis, M. Faverge, A. Franc, J.-M. Frigerio, N. Furmento, A. Guilbaud, E. Jeannot, R. Peressoni, F. Pruvost, and S. Thibault. Task-based randomized singular value decomposition and multidimensional scaling. Research Report RR-9482, Inria Bordeaux - Sud Ouest ; Inrae - BioGeCo, September 2022.

Datasets for tutorials

Three small datasets are available for learning how to use the library:

  • diatoms_sweden: An array species x environment for diatoms in Scandinavia for PCA
  • example_coa: An example from the book Lebart, Morineau & Fénelon, 1982, for CoA
  • guiana_trees: A dissimilarity array between barcodes of Amazonian trees in French Guiana for MDS.

These datasets are available in a dedicated git, named data4test. To get it, just clone it at same level than pydiodon, i.e. in diodon directory. The procedure is as follows:

# go into directory diodon
$cd [..]/diodon

# clone the git
$git clone git@gitlab.inria.fr:diodon/data4tests.git

Then, datasets are in directory .../diodon/data4tests.

To load a dataset, for example the dataset for PCA, be in directory pydiodon/jupyter (see why below), and simply type:

>>> import pydiodon as dio
>>> A, rn, cn = dio.load_dataset("diatoms_sweden")

Then, the array will be in A, and rownames and colnames respectively in rn and cn. here is a simple example for MDS:

>>> import pydiodon as dio
>>> A, rn, cn = dio.load_dataset("guiana_trees")
>>> X, S = dio.mds(A)
>>> dio.plot_components_scatter(X)

The dataset can be downloaded from any directory, provided the path to diodon/data4tests is specified as a second argument of function load_dataset(). This second argument has been set by default as the path from the directory with Jupyter notebooks, to be downloaded easily for any notebook. Let us assume the user has created a directory diodon/pydiodon/myproject and is in directory myproject. Then, load_dataset() will work with default setting. Let us assume now that the user has a own directory diodon/myprojects/thisproject. Then, loading a dataset is made through:

>>> import pydiodon as dio
>>> A, rn, cn = dio.load_dataset("gjuiana_trees", datadir="../../data4tests/")

Do not forget the "/" at the end of the name of the directory.

Tutorials with Jupyter notebooks

Several notebooks are available as tutorials, one for one method, as follows

Notebook method ipynb html
pca_with_diodon PCA yes yes
coa_with_diodon CoA yes yes
mds_with_diodon MDS yes yes

The notebooks are given in two formats:

  • ipynb, where they are interactive
  • html, where they are frozen

For those who wish to play with the interactive version ipynb, it is advised to :

  • create a directory not followed by the git and copy the notebooks, as
# be in $pydiodon
$mkdir my_notebooks

# copy the notebooks
$cd my_notebooks
$cp ../jupyter/*.ipynb .
  • change the names, parameters, values in this directory only. Indeed, diodon team may wish to update a jupyter notebook, and push the change on the git. If the user has changeed himself/herself the same notebook, there will be a cnflict of versions.

Documentation

The library is documented with Sphynx. html file is available at

ID card

maintainer: Alain Franc

mail: alain.franc@inrae.fr

contributors:

  • Olivier Coulaud
  • Alain Franc
  • Jean-Marc Frigerio
  • Romain Peressoni
  • Florent Pruvost

started: 21/02/17 version: 22.11.07

release: ongoing

licence: GPL-3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydiodon-0.0.2.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydiodon-0.0.2-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file pydiodon-0.0.2.tar.gz.

File metadata

  • Download URL: pydiodon-0.0.2.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pydiodon-0.0.2.tar.gz
Algorithm Hash digest
SHA256 fe185cbff2eeff0d04bd58fdedcfd786ba0622d0724aa2a8ab1b5a1574f10139
MD5 527a4e45f002d6151e9d0615876c0b4c
BLAKE2b-256 b21eabc142e2065bed073a73d04eff53bcefe2d1b5a58879260934c1511590dd

See more details on using hashes here.

File details

Details for the file pydiodon-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pydiodon-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pydiodon-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d960145a20396a22e3e8a511a2ae1e08178ac71fa3c0484cc3d9f029fcd86755
MD5 55e331e536257295a6c07fdb05ada75f
BLAKE2b-256 1285d23d320992dc7711eee7b55911b8996a699238e264e07011e9795dcfe178

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page