pydiodon
Project description
pydiodon
What it is
numpy library for linear dimension reduction, part of diodon project
Overview
The library provides functions to call most common linear dimension reduction methods, like
- PCA (Principal Component Analysis)
- CoA (Correspondence Analysis)
- MDS (Multidimensional Scaling)
Those three can be considered as parts of the release.
Other methods have been coded too, but tests are ongoing and the result is not garanteed, like:
- PCA-IV (PCA with Instrulental Variables, equivalent to PLS)
- PCAmet (PCA with metrics on spaces spanned by the rows and the columns)
- Can (Canonical Analysis)
- MCoA (Multiple Correspondence Analysis)
- MCan (Multiple Canonical Analysis)
Finally, a few tools are available (like plotting or computing indices) to facilitate the interpretaton of the results.
Install
` The installation procedure is given for Linux Ubunto 20 and up.
Diodon is written in python 3.8. Such a version or up must be present on the computer. The following python librairies must be installed:
- time
- os
- sys
- h5py
- numpy
- scipy
- matplotlib.pyplot
To install pydiodon, the user must have a directory named diodon somewhere in his/her computer. Installation is along the following steps:
# create directory diodon
[...]$mkdir diodon
# go into this directory,
$cd diodon
# clone pydiodon
$git clone git@gitlab.inria.fr:diodon/pydiodon.git
# go into pydiodon subdirectory
$cd pydiodon
# install pydiodon with a setup.py
$sudo python3 setup.py install
To get started ..
Here is a simple toy example of Principal Components Analysis on a small random matrix.
First, create a toy matrix:
# importing library
>>> import numpy as np # for creating the random matrix
>>> import pydiodon as dio
# creating a random matrix
>>> m = 10
>>> n = 5
>>> A = np.random.randn(m,n)
Then, the diodon command to perform PCA:
# running PCA
>>> Y, L, V = dio.pca(A, pretreatment="standard", k=-1, meth="svd")
Followed by a few functions for plotting the results
# plotting the results
>>> dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
>>> dio.plot_components_scatter(Y, dot_size=5, title="Principal components")
>>> dio.plot_var(V, varnames=None)
Why another library for Linear Dimension Reduction?
There exists several excellent libraries for PCA and related methods, especially in R, or some methods in Scikit-learn in python (see https://scikit-learn.org/stable/modules/decomposition.html#decompositions).
A specific effort has been made for efficiency when analysing large datasets, and motivates the development and disseminatuon of library Diodon. The limiting factors are currently:
- the time for I/O
- the available RAM and not the calculation time. The effort has focused on computing the SVD of a given matrix, which is a key step providing the results for any method.
Progresses in efficiency have been obtained through three choices, available when useful:
- use Random projection methods for computing the SVD of a large matrix
- bind numpy calls of functions with codes written in C++ with xxxx
- task based programming with Chameleon (for MDS only, on HPC architectures with distributed memory)
Using random projection methods is not new here. See e.g.https://scikit-learn.org/stable/modules/random_projection.html in scikit learn. In diodon, Gaussian Random Projectkon only has been implemented.
For the connection between MDS and rSVD, see
- P. Blanchard, P. Chaumeil, J.-M. Frigerio, F. Rimet, F. Salin, S. Thérond, O. Coulaud, and A. Franc. A geometric view of Biodiversity: scaling to metage- nomics. Research Report RR-9144, INRIA ; INRA, January 2018
For development of this approach with task based programming, distributed memory and chameleon, see
- E. Agullo, O. Coulaud, A. Denis, M. Faverge, A. Franc, J.-M. Frigerio, N. Furmento, A. Guilbaud, E. Jeannot, R. Peressoni, F. Pruvost, and S. Thibault. Task-based randomized singular value decomposition and multidimensional scaling. Research Report RR-9482, Inria Bordeaux - Sud Ouest ; Inrae - BioGeCo, September 2022.
Datasets for tutorials
Three small datasets are available for learning how to use the library:
- diatoms_sweden: An array species x environment for diatoms in Scandinavia for PCA
- example_coa: An example from the book Lebart, Morineau & Fénelon, 1982, for CoA
- guiana_trees: A dissimilarity array between barcodes of Amazonian trees in French Guiana for MDS.
These datasets are available in a dedicated git, named data4test. To get it, just clone it at same level than pydiodon, i.e. in diodon directory. The procedure is as follows:
# go into directory diodon
$cd [..]/diodon
# clone the git
$git clone git@gitlab.inria.fr:diodon/data4tests.git
Then, datasets are in directory .../diodon/data4tests.
To load a dataset, for example the dataset for PCA, be in directory pydiodon/jupyter (see why below), and simply type:
>>> import pydiodon as dio
>>> A, rn, cn = dio.load_dataset("diatoms_sweden")
Then, the array will be in A, and rownames and colnames respectively in rn and cn. here is a simple example for MDS:
>>> import pydiodon as dio
>>> A, rn, cn = dio.load_dataset("guiana_trees")
>>> X, S = dio.mds(A)
>>> dio.plot_components_scatter(X)
The dataset can be downloaded from any directory, provided the path to diodon/data4tests is specified as a second argument of function load_dataset(). This second argument has been set by default as the path from the directory with Jupyter notebooks, to be downloaded easily for any notebook. Let us assume the user has created a directory diodon/pydiodon/myproject and is in directory myproject. Then, load_dataset() will work with default setting. Let us assume now that the user has a own directory diodon/myprojects/thisproject. Then, loading a dataset is made through:
>>> import pydiodon as dio
>>> A, rn, cn = dio.load_dataset("gjuiana_trees", datadir="../../data4tests/")
Do not forget the "/" at the end of the name of the directory.
Tutorials with Jupyter notebooks
Several notebooks are available as tutorials, one for one method, as follows
| Notebook | method | ipynb | html |
|---|---|---|---|
| pca_with_diodon | PCA | yes | yes |
| coa_with_diodon | CoA | yes | yes |
| mds_with_diodon | MDS | yes | yes |
The notebooks are given in two formats:
- ipynb, where they are interactive
- html, where they are frozen
For those who wish to play with the interactive version ipynb, it is advised to :
- create a directory not followed by the git and copy the notebooks, as
# be in $pydiodon
$mkdir my_notebooks
# copy the notebooks
$cd my_notebooks
$cp ../jupyter/*.ipynb .
- change the names, parameters, values in this directory only. Indeed, diodon team may wish to update a jupyter notebook, and push the change on the git. If the user has changeed himself/herself the same notebook, there will be a cnflict of versions.
Documentation
The library is documented with Sphynx. html file is available at
ID card
maintainer: Alain Franc
mail: alain.franc@inrae.fr
contributors:
- Olivier Coulaud
- Alain Franc
- Jean-Marc Frigerio
- Romain Peressoni
- Florent Pruvost
started: 21/02/17 version: 22.11.07
release: ongoing
licence: GPL-3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydiodon-0.0.2.tar.gz.
File metadata
- Download URL: pydiodon-0.0.2.tar.gz
- Upload date:
- Size: 3.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe185cbff2eeff0d04bd58fdedcfd786ba0622d0724aa2a8ab1b5a1574f10139
|
|
| MD5 |
527a4e45f002d6151e9d0615876c0b4c
|
|
| BLAKE2b-256 |
b21eabc142e2065bed073a73d04eff53bcefe2d1b5a58879260934c1511590dd
|
File details
Details for the file pydiodon-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pydiodon-0.0.2-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d960145a20396a22e3e8a511a2ae1e08178ac71fa3c0484cc3d9f029fcd86755
|
|
| MD5 |
55e331e536257295a6c07fdb05ada75f
|
|
| BLAKE2b-256 |
1285d23d320992dc7711eee7b55911b8996a699238e264e07011e9795dcfe178
|