Skip to main content

ABC random forests for model choice and parameter estimation, python wrapper

Project description

ABC random forests for model choice and parameters estimation

PyPI abcranger-build

Random forests methodologies for :

Libraries we use :

As a mention, we use our own implementation of LDA and PLS from (Friedman, Hastie, and Tibshirani 2001, 1:81, 114), PLS is optimized for univariate, see 5.1. For linear algebra optimization purposes on large reftables, the Linux version of binaries (standalone and python wheel) are statically linked with Intel’s Math Kernel Library, in order to leverage multicore and SIMD extensions on modern cpus.

There is one set of binaries, which contains a Macos/Linux/Windows (x64 only) binary for each platform. There are available within the “Releases” tab, under “Assets” section (unfold it to see the list).

This is pure command line binary, and they are no prerequisites or library dependencies in order to run it. Just download them and launch them from your terminal software of choice. The usual caveats with command line executable apply there : if you’re not proficient with the command line interface of your platform, please learn some basics or ask someone who might help you in those matters.

The standalone is part of a specialized Population Genetics graphical interface DIYABC-RF, presented in MER (Molecular Ecology Resources, Special Issue), (Collin et al. 2021).

Python

Installation

pip install pyabcranger

Notebooks examples

Usage

 - ABC Random Forest - Model choice or parameter estimation command line options
Usage:
  ../build/abcranger [OPTION...]

  -h, --header arg        Header file (default: headerRF.txt)
  -r, --reftable arg      Reftable file (default: reftableRF.bin)
  -b, --statobs arg       Statobs file (default: statobsRF.txt)
  -o, --output arg        Prefix output (modelchoice_out or estimparam_out by
                          default)
  -n, --nref arg          Number of samples, 0 means all (default: 0)
  -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
                          5 for regression (default: 0)
  -t, --ntree arg         Number of trees (default: 500)
  -j, --threads arg       Number of threads, 0 means all (default: 0)
  -s, --seed arg          Seed, generated by default (default: 0)
  -c, --noisecolumns arg  Number of noise columns (default: 5)
      --nolinear          Disable LDA for model choice or PLS for parameter
                          estimation
      --plsmaxvar arg     Percentage of maximum explained Y-variance for
                          retaining pls axis (default: 0.9)
      --chosenscen arg    Chosen scenario (mandatory for parameter
                          estimation)
      --noob arg          number of oob testing samples (mandatory for
                          parameter estimation)
      --parameter arg     name of the parameter of interest (mandatory for
                          parameter estimation)
  -g, --groups arg        Groups of models
      --help              Print help
  • If you provide --chosenscen, --parameter and --noob, parameter estimation mode is selected.
  • Otherwise by default it’s model choice mode.
  • Linear additions are LDA for model choice and PLS for parameter estimation, “–nolinear” options disables them in both case.

Model Choice

Terminal model choice

Example

Example :

abcranger -t 10000 -j 8

Header, reftable and statobs files should be in the current directory.

Groups

With the option -g (or --groups), you may “group” your models in several groups splitted . For example if you have six models, labeled from 1 to 6 `-g “1,2,3;4,5,6”

Generated files

Four files are created :

  • modelchoice_out.ooberror : OOB Error rate vs number of trees (line number is the number of trees)
  • modelchoice_out.importance : variables importance (sorted)
  • modelchoice_out.predictions : votes, prediction and posterior error rate
  • modelchoice_out.confusion : OOB Confusion matrix of the classifier

Parameter Estimation

Terminal estim param

Composite parameters

When specifying the parameter (option --parameter), one may specify simple composite parameters as division, addition or multiplication of two existing parameters. like t/N or T1+T2.

A note about PLS heuristic

The --plsmaxvar option (defaulting at 0.90) fixes the number of selected pls axes so that we get at least the specified percentage of maximum explained variance of the output. The explained variance of the output of the m first axes is defined by the R-squared of the output:

Yvar^m = \frac{\sum_{i=1}^{N}{(\hat{y}^{m}_{i}-\bar{y})^2}}{\sum_{i=1}^{N}{(y_{i}-\hat{y})^2}}

where \hat{y}^{m} is the output Y scored by the pls for the mth component. So, only the n_{comp} first axis are kept, and :

n_{comp} = \underset{Yvar^m \leq{} 0.90*Yvar^M, }{\operatorname{argmax}}

Note that if you specify 0 as --plsmaxvar, an “elbow” heuristic is activiated where the following condition is tested for every computed axis :

\frac{Yvar^{k+1}+Yvar^{k}}{2} \geq 0.99(N-k)\left(Yvar^{k+1}-Yvar^ {k}\right)

If this condition is true for a windows of previous axes, sized to 10% of the total possible axis, then we stop the PLS axis computation.

In practice, we find this n_{heur} close enough to the previous n_{comp} for 99%, but it isn’t guaranteed.

The signification of the noob parameter

The median global/local statistics and confidence intervals (global) measures for parameter estimation need a number of OOB samples (--noob) to be reliable (typlially 30% of the size of the dataset is sufficient). Be aware than computing the whole set (i.e. assigning --noob the same than for --nref) for weights predictions (Raynal et al. 2018) could be very costly, memory and cpu-wise, if your dataset is large in number of samples, so it could be adviseable to compute them for only choose a subset of size noob.

Example (parameter estimation)

Example (working with the dataset in test/data) :

abcranger -t 1000 -j 8 --parameter ra --chosenscen 1 --noob 50

Header, reftable and statobs files should be in the current directory.

Generated files (parameter estimation)

Five files (or seven if pls activated) are created :

  • estimparam_out.ooberror : OOB MSE rate vs number of trees (line number is the number of trees)
  • estimparam_out.importance : variables importance (sorted)
  • estimparam_out.predictions : expectation, variance and 0.05, 0.5, 0.95 quantile for prediction
  • estimparam_out.predweights : csv of the value/weights pairs of the prediction (for density plot)
  • estimparam_out.oobstats : various statistics on oob (MSE, NMSE, NMAE etc.)

if pls enabled :

  • estimparam_out.plsvar : variance explained by number of components
  • estimparam_out.plsweights : variable weight in the first component (sorted by absolute value)

Various

Partial Least Squares algorithm

  1. X_{0}=X ; y_{0}=y
  2. For k=1,2,...,s :
    1. w_{k}=\frac{X_{k-1}^{T} y_{k-1}}{y_{k-1}^{T} y_{k-1}}
    2. Normalize w_k to 1
    3. t_{k}=\frac{X_{k-1} w_{k}}{w_{k}^{T} w_{k}}
    4. p_{k}=\frac{X_{k-1}^{T} t_{k}}{t_{k}^{T} t_{k}}
    5. X_{k}=X_{k-1}-t_{k} p_{k}^{T}
    6. q_{k}=\frac{y_{k-1}^{T} t_{k}}{t_{k}^{T} t_{k}}
    7. u_{k}=\frac{y_{k-1}}{q_{k}}
    8. y_{k}=y_{k-1}-q_{k} t_{k}

Comment When there isn’t any missing data, stages 2.1 and 2.2 could be replaced by w_{k}=\frac{X_{k-1}^{T} y_{k-1}}{\left\|X_{k-1}^{T} y_{k-1}\right\|} and 2.3 by t_{k}=X_{k-1}w_{k}

To get W so that T=XW we compute :

\mathbf{W}=\mathbf{W}^{*}\left(\widetilde{\mathbf{P}} \mathbf{W}^{*}\right)^{-1}

where \widetilde{\mathbf{P}}_{K \times p}=\mathbf{t}\left[p_{1}, \ldots, p_{K}\right] where \mathbf{W}^{*}_{p \times K} = [w_1, \ldots, w_K]

TODO

Input/Output

  • Integrate hdf5 (or exdir? msgpack?) routines to save/load reftables/observed stats with associated metadata
  • Provide R code to save/load the data
  • Provide Python code to save/load the data

C++ standalone

  • Merge the two methodologies in a single executable with the (almost) the same options
  • (Optional) Possibly move to another options parser (CLI?)

External interfaces

  • R package
  • Python package

Documentation

  • Code documentation
  • Document the build

Continuous integration

  • Linux CI build with intel/MKL optimizations
  • osX CI build
  • Windows CI build

Long/Mid term TODO

  • methodologies parameters auto-tuning
    • auto-discovering the optimal number of trees by monitoring OOB error
    • auto-limiting number of threads by available memory
  • Streamline the two methodologies (model choice and then parameters estimation)
  • Write our own tree/rf implementation with better storage efficiency than ranger
  • Make functional tests for the two methodologies
  • Possible to use mondrian forests for online batches ? See (Lakshminarayanan, Roy, and Teh 2014)

References

This have been the subject of a proceedings in JOBIM 2020, PDF and video (in french), (Collin et al. 2020).

Collin, François-David, Ghislain Durif, Louis Raynal, Eric Lombaert, Mathieu Gautier, Renaud Vitalis, Jean-Michel Marin, and Arnaud Estoup. 2021. “Extending Approximate Bayesian Computation with Supervised Machine Learning to Infer Demographic History from Genetic Polymorphisms Using DIYABC Random Forest.” Molecular Ecology Resources 21 (8): 2598–2613. https://doi.org/https://doi.org/10.1111/1755-0998.13413.

Collin, François-David, Arnaud Estoup, Jean-Michel Marin, and Louis Raynal. 2020. “Bringing ABC inference to the machine learning realm : AbcRanger, an optimized random forests library for ABC.” In JOBIM 2020, 2020:66. JOBIM. Montpellier, France. https://hal.archives-ouvertes.fr/hal-02910067.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. 10. Springer series in statistics New York, NY, USA:

Guennebaud, Gaël, Benoît Jacob, et al. 2010. “Eigen V3.” http://eigen.tuxfamily.org.

Lakshminarayanan, Balaji, Daniel M Roy, and Yee Whye Teh. 2014. “Mondrian Forests: Efficient Online Random Forests.” In Advances in Neural Information Processing Systems, 3140–48.

Lintusaari, Jarno, Henri Vuollekoski, Antti Kangasrääsiö, Kusti Skytén, Marko Järvenpää, Pekka Marttinen, Michael U. Gutmann, Aki Vehtari, Jukka Corander, and Samuel Kaski. 2018. “ELFI: Engine for Likelihood-Free Inference.” Journal of Machine Learning Research 19 (16): 1–7. http://jmlr.org/papers/v19/17-374.html.

Pudlo, Pierre, Jean-Michel Marin, Arnaud Estoup, Jean-Marie Cornuet, Mathieu Gautier, and Christian P Robert. 2015. “Reliable ABC Model Choice via Random Forests.” Bioinformatics 32 (6): 859–66.

Raynal, Louis, Jean-Michel Marin, Pierre Pudlo, Mathieu Ribatet, Christian P Robert, and Arnaud Estoup. 2018. “ABC random forests for Bayesian parameter inference.” Bioinformatics 35 (10): 1720–28. https://doi.org/10.1093/bioinformatics/bty867.

Wright, Marvin N, and Andreas Ziegler. 2015. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in c++ and r.” arXiv Preprint arXiv:1508.04409.

[^1]: The term “online” there and in the code has not the usual meaning it has, as coined in “online machine learning”. We still need the entire training data set at once. Our implementation is an “online” one not by the sequential order of the input data, but by the sequential order of computation of the trees in random forests, sequentially computed and then discarded.

[^2]: We only use the C++ Core of ranger, which is under MIT License, same as ours.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyabcranger-0.0.69.tar.gz (58.1 kB view details)

Uploaded Source

Built Distributions

pyabcranger-0.0.69-cp311-cp311-win_amd64.whl (630.1 kB view details)

Uploaded CPython 3.11 Windows x86-64

pyabcranger-0.0.69-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.69-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

pyabcranger-0.0.69-cp310-cp310-win_amd64.whl (630.4 kB view details)

Uploaded CPython 3.10 Windows x86-64

pyabcranger-0.0.69-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.69-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pyabcranger-0.0.69-cp39-cp39-win_amd64.whl (628.6 kB view details)

Uploaded CPython 3.9 Windows x86-64

pyabcranger-0.0.69-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.69-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

pyabcranger-0.0.69-cp38-cp38-win_amd64.whl (630.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

pyabcranger-0.0.69-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.69-cp38-cp38-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pyabcranger-0.0.69-cp37-cp37m-win_amd64.whl (630.5 kB view details)

Uploaded CPython 3.7m Windows x86-64

pyabcranger-0.0.69-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.69-cp37-cp37m-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file pyabcranger-0.0.69.tar.gz.

File metadata

  • Download URL: pyabcranger-0.0.69.tar.gz
  • Upload date:
  • Size: 58.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for pyabcranger-0.0.69.tar.gz
Algorithm Hash digest
SHA256 bc69ad705502ca937f0df447d15a2cdd82ecb05fc1d5026c674ba58e1627f391
MD5 d753357721d022bbe691c3f19aa6a7c0
BLAKE2b-256 473721ddd826ccf085c31879705a0b283dd3f16e0712ce8563d325e786b7ed7b

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 a4377c4b6b08d3734ed13f3a71e31d47fdfee219a369da2f3127a22715b33240
MD5 88678f467d62418532046f5776997237
BLAKE2b-256 8f300e27c944bd2ff565cb86c54b0e09dd18098284df4f4227a85d2c02b767a3

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7ffb74d799ad2f4d252d994ee2498ad03aa083687062c17ad6d64e7370c4d17d
MD5 6f965152442d216d90b6bb3bbcd91b8d
BLAKE2b-256 dc8c8329409eb4f8342418ce0bc990ebe95fd53ce5ee9cdd329d91000fa3bf96

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 525cbdf95e6fbc18c56bb65a9a9e5227116606f9de549083629501c9bba515c9
MD5 10765406da9a8f7b44cb7eab6448a3e0
BLAKE2b-256 917511ee527c10aab708af0090393c0defece67846ed213a5919dfb5dceced54

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 f25f2a697bd4a1dda94c5e8a45ba976bb5502e3182d758eb7870dc16d201845c
MD5 08b72fca0fbd1fc33f62047658122144
BLAKE2b-256 8180f425fb74723824f4118cc615724d7a7d7f96beae2b05cc6ce81b42d4dd31

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 29041faa44065501d8cfb58d11aac57b4082e7ebfd4c09317315078d22b888e2
MD5 99924666a448ec99aad450f1c62cc840
BLAKE2b-256 fe80ec8d3cdfd724d54542e38302655d66095075a3f694f5385af4356cf62f3f

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fe6b843a8ee431f8fd94f29e430dfec2a405942d0c264ceeec37c96735f207c9
MD5 7d8ad8abd0c64a7ffbe3587ec31581f6
BLAKE2b-256 c7d66775b7ac5f60f67ac0a13c3cf0e17174dc7b60c0f0f67fbdc309a49da3a0

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2824f70e9a6186aa8faf5f06087acf06919144c3d5b20e932121de0fd304cb50
MD5 b915db493c0db8310722a93620df8b42
BLAKE2b-256 70d24a3d8bc5ad7dd12cb539cc61ca42458a2be6ec4ce793ead5e2098f603818

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c0295a0c5b236b38c3e1eb73dd09d78450342713a6dffba597e35816a92a619a
MD5 f8ff973a71b0981c7755b2f3043c0ca4
BLAKE2b-256 535dff70b9c092cd3cf8c94f0161ac84dbe8d690c741be3691addcd331992bbe

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9c468249897b88340a9def8ae6c1de9cacd63d60c977524e9a068597f6b07d13
MD5 dcfcc330444dc2dfdcfda8f8795fd846
BLAKE2b-256 c38f3312711f29e16f1f4780762c23cd6452d704de475eb9acf29f2817ed7f0d

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ee513f0565c1e721e32e57f3387e0eaf6a778ab8634fa833249563cdb0f77666
MD5 f0ac63fac940943b5000993cb131bfab
BLAKE2b-256 93c524d9cb1d30ed2e0ec29ed6da874679c4e985a8907a9de833c030f7256df0

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8489f5094cd8be9d9cf83f98b32bf99212bfe915af9235894aa2b05f5a5520f2
MD5 5a7910075589ffdf3c27a612bad3b4c5
BLAKE2b-256 fd45108029ff075d6587f3cee4bf4e9c7acdb4159b63647620801e1c9dcba6e4

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b90e0c55ae333174d49bdf2b07b0d80c5a99d38d6f6ae7ffba59e69a07a4d82f
MD5 5d2dd21a99b5ee71587d230a6d1455e3
BLAKE2b-256 e1e5e08857b67abcdfe669e062bc77efe75f42ac383d1e51180334e0ccc97d58

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e338bc6a54a7bd17493c7fda98c6407f95da126fc7d90ce3d7f4311841deddae
MD5 bbf7c18a1ed5fcb706a61e0d39cfaf8b
BLAKE2b-256 832c7da8e0c78e2ac115976e73020699260f4c89ab2e37c9b10569cdee30a6a0

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 61756b744fa9b20b83af29aac147ac3224a50f8c12e0aeeacb558ec59b5c72e9
MD5 c65777ca7e7a2f128c608274b4da1c9b
BLAKE2b-256 26538b496cde4dec0ea4d73c8e637428828763cfb4948f9b25aaf10087c4c34d

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.69-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.69-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1a6ee4b14d2a6bfa946274cb82dce02d0c6feb7ae929f5fdc5f1042334bbc0cb
MD5 10e22870bfad5e1620b49106974d8129
BLAKE2b-256 dc03f3897a9b246d383daf7d22d469ca83cd34298a537d97ec309c6c617b3ecb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page