Skip to main content

ABC random forests for model choice and parameter estimation, python wrapper

Project description

ABC random forests for model choice and parameters estimation

PyPI abcranger-build

Random forests methodologies for :

Libraries we use :

As a mention, we use our own implementation of LDA and PLS from (Friedman, Hastie, and Tibshirani 2001, 1:81, 114), PLS is optimized for univariate, see 5.1. For linear algebra optimization purposes on large reftables, the Linux version of binaries (standalone and python wheel) are statically linked with Intel’s Math Kernel Library, in order to leverage multicore and SIMD extensions on modern cpus.

There is one set of binaries, which contains a Macos/Linux/Windows (x64 only) binary for each platform. There are available within the “Releases” tab, under “Assets” section (unfold it to see the list).

This is pure command line binary, and they are no prerequisites or library dependencies in order to run it. Just download them and launch them from your terminal software of choice. The usual caveats with command line executable apply there : if you’re not proficient with the command line interface of your platform, please learn some basics or ask someone who might help you in those matters.

The standalone is part of a specialized Population Genetics graphical interface DIYABC-RF, presented in MER (Molecular Ecology Resources, Special Issue), (Collin et al. 2021).

Python

Installation

pip install pyabcranger

Notebooks examples

Usage

 - ABC Random Forest - Model choice or parameter estimation command line options
Usage:
  ../build/abcranger [OPTION...]

  -h, --header arg        Header file (default: headerRF.txt)
  -r, --reftable arg      Reftable file (default: reftableRF.bin)
  -b, --statobs arg       Statobs file (default: statobsRF.txt)
  -o, --output arg        Prefix output (modelchoice_out or estimparam_out by
                          default)
  -n, --nref arg          Number of samples, 0 means all (default: 0)
  -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
                          5 for regression (default: 0)
  -t, --ntree arg         Number of trees (default: 500)
  -j, --threads arg       Number of threads, 0 means all (default: 0)
  -s, --seed arg          Seed, generated by default (default: 0)
  -c, --noisecolumns arg  Number of noise columns (default: 5)
      --nolinear          Disable LDA for model choice or PLS for parameter
                          estimation
      --plsmaxvar arg     Percentage of maximum explained Y-variance for
                          retaining pls axis (default: 0.9)
      --chosenscen arg    Chosen scenario (mandatory for parameter
                          estimation)
      --noob arg          number of oob testing samples (mandatory for
                          parameter estimation)
      --parameter arg     name of the parameter of interest (mandatory for
                          parameter estimation)
  -g, --groups arg        Groups of models
      --help              Print help
  • If you provide --chosenscen, --parameter and --noob, parameter estimation mode is selected.
  • Otherwise by default it’s model choice mode.
  • Linear additions are LDA for model choice and PLS for parameter estimation, “–nolinear” options disables them in both case.

Model Choice

Terminal model choice

Example

Example :

abcranger -t 10000 -j 8

Header, reftable and statobs files should be in the current directory.

Groups

With the option -g (or --groups), you may “group” your models in several groups splitted . For example if you have six models, labeled from 1 to 6 `-g “1,2,3;4,5,6”

Generated files

Four files are created :

  • modelchoice_out.ooberror : OOB Error rate vs number of trees (line number is the number of trees)
  • modelchoice_out.importance : variables importance (sorted)
  • modelchoice_out.predictions : votes, prediction and posterior error rate
  • modelchoice_out.confusion : OOB Confusion matrix of the classifier

Parameter Estimation

Terminal estim param

Composite parameters

When specifying the parameter (option --parameter), one may specify simple composite parameters as division, addition or multiplication of two existing parameters. like t/N or T1+T2.

A note about PLS heuristic

The --plsmaxvar option (defaulting at 0.90) fixes the number of selected pls axes so that we get at least the specified percentage of maximum explained variance of the output. The explained variance of the output of the m first axes is defined by the R-squared of the output:

Yvar^m = \frac{\sum_{i=1}^{N}{(\hat{y}^{m}_{i}-\bar{y})^2}}{\sum_{i=1}^{N}{(y_{i}-\hat{y})^2}}

where \hat{y}^{m} is the output Y scored by the pls for the mth component. So, only the n_{comp} first axis are kept, and :

n_{comp} = \underset{Yvar^m \leq{} 0.90*Yvar^M, }{\operatorname{argmax}}

Note that if you specify 0 as --plsmaxvar, an “elbow” heuristic is activiated where the following condition is tested for every computed axis :

\frac{Yvar^{k+1}+Yvar^{k}}{2} \geq 0.99(N-k)\left(Yvar^{k+1}-Yvar^ {k}\right)

If this condition is true for a windows of previous axes, sized to 10% of the total possible axis, then we stop the PLS axis computation.

In practice, we find this n_{heur} close enough to the previous n_{comp} for 99%, but it isn’t guaranteed.

The signification of the noob parameter

The median global/local statistics and confidence intervals (global) measures for parameter estimation need a number of OOB samples (--noob) to be reliable (typlially 30% of the size of the dataset is sufficient). Be aware than computing the whole set (i.e. assigning --noob the same than for --nref) for weights predictions (Raynal et al. 2018) could be very costly, memory and cpu-wise, if your dataset is large in number of samples, so it could be adviseable to compute them for only choose a subset of size noob.

Example (parameter estimation)

Example (working with the dataset in test/data) :

abcranger -t 1000 -j 8 --parameter ra --chosenscen 1 --noob 50

Header, reftable and statobs files should be in the current directory.

Generated files (parameter estimation)

Five files (or seven if pls activated) are created :

  • estimparam_out.ooberror : OOB MSE rate vs number of trees (line number is the number of trees)
  • estimparam_out.importance : variables importance (sorted)
  • estimparam_out.predictions : expectation, variance and 0.05, 0.5, 0.95 quantile for prediction
  • estimparam_out.predweights : csv of the value/weights pairs of the prediction (for density plot)
  • estimparam_out.oobstats : various statistics on oob (MSE, NMSE, NMAE etc.)

if pls enabled :

  • estimparam_out.plsvar : variance explained by number of components
  • estimparam_out.plsweights : variable weight in the first component (sorted by absolute value)

Various

Partial Least Squares algorithm

  1. X_{0}=X ; y_{0}=y
  2. For k=1,2,...,s :
    1. w_{k}=\frac{X_{k-1}^{T} y_{k-1}}{y_{k-1}^{T} y_{k-1}}
    2. Normalize w_k to 1
    3. t_{k}=\frac{X_{k-1} w_{k}}{w_{k}^{T} w_{k}}
    4. p_{k}=\frac{X_{k-1}^{T} t_{k}}{t_{k}^{T} t_{k}}
    5. X_{k}=X_{k-1}-t_{k} p_{k}^{T}
    6. q_{k}=\frac{y_{k-1}^{T} t_{k}}{t_{k}^{T} t_{k}}
    7. u_{k}=\frac{y_{k-1}}{q_{k}}
    8. y_{k}=y_{k-1}-q_{k} t_{k}

Comment When there isn’t any missing data, stages 2.1 and 2.2 could be replaced by w_{k}=\frac{X_{k-1}^{T} y_{k-1}}{\left\|X_{k-1}^{T} y_{k-1}\right\|} and 2.3 by t_{k}=X_{k-1}w_{k}

To get W so that T=XW we compute :

\mathbf{W}=\mathbf{W}^{*}\left(\widetilde{\mathbf{P}} \mathbf{W}^{*}\right)^{-1}

where \widetilde{\mathbf{P}}_{K \times p}=\mathbf{t}\left[p_{1}, \ldots, p_{K}\right] where \mathbf{W}^{*}_{p \times K} = [w_1, \ldots, w_K]

TODO

Input/Output

  • Integrate hdf5 (or exdir? msgpack?) routines to save/load reftables/observed stats with associated metadata
  • Provide R code to save/load the data
  • Provide Python code to save/load the data

C++ standalone

  • Merge the two methodologies in a single executable with the (almost) the same options
  • (Optional) Possibly move to another options parser (CLI?)

External interfaces

  • R package
  • Python package

Documentation

  • Code documentation
  • Document the build

Continuous integration

  • Linux CI build with intel/MKL optimizations
  • osX CI build
  • Windows CI build

Long/Mid term TODO

  • methodologies parameters auto-tuning
    • auto-discovering the optimal number of trees by monitoring OOB error
    • auto-limiting number of threads by available memory
  • Streamline the two methodologies (model choice and then parameters estimation)
  • Write our own tree/rf implementation with better storage efficiency than ranger
  • Make functional tests for the two methodologies
  • Possible to use mondrian forests for online batches ? See (Lakshminarayanan, Roy, and Teh 2014)

References

This have been the subject of a proceedings in JOBIM 2020, PDF and video (in french), (Collin et al. 2020).

Collin, François-David, Ghislain Durif, Louis Raynal, Eric Lombaert, Mathieu Gautier, Renaud Vitalis, Jean-Michel Marin, and Arnaud Estoup. 2021. “Extending Approximate Bayesian Computation with Supervised Machine Learning to Infer Demographic History from Genetic Polymorphisms Using DIYABC Random Forest.” Molecular Ecology Resources 21 (8): 2598–2613. https://doi.org/https://doi.org/10.1111/1755-0998.13413.

Collin, François-David, Arnaud Estoup, Jean-Michel Marin, and Louis Raynal. 2020. “Bringing ABC inference to the machine learning realm : AbcRanger, an optimized random forests library for ABC.” In JOBIM 2020, 2020:66. JOBIM. Montpellier, France. https://hal.archives-ouvertes.fr/hal-02910067.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. 10. Springer series in statistics New York, NY, USA:

Guennebaud, Gaël, Benoît Jacob, et al. 2010. “Eigen V3.” http://eigen.tuxfamily.org.

Lakshminarayanan, Balaji, Daniel M Roy, and Yee Whye Teh. 2014. “Mondrian Forests: Efficient Online Random Forests.” In Advances in Neural Information Processing Systems, 3140–48.

Lintusaari, Jarno, Henri Vuollekoski, Antti Kangasrääsiö, Kusti Skytén, Marko Järvenpää, Pekka Marttinen, Michael U. Gutmann, Aki Vehtari, Jukka Corander, and Samuel Kaski. 2018. “ELFI: Engine for Likelihood-Free Inference.” Journal of Machine Learning Research 19 (16): 1–7. http://jmlr.org/papers/v19/17-374.html.

Pudlo, Pierre, Jean-Michel Marin, Arnaud Estoup, Jean-Marie Cornuet, Mathieu Gautier, and Christian P Robert. 2015. “Reliable ABC Model Choice via Random Forests.” Bioinformatics 32 (6): 859–66.

Raynal, Louis, Jean-Michel Marin, Pierre Pudlo, Mathieu Ribatet, Christian P Robert, and Arnaud Estoup. 2018. “ABC random forests for Bayesian parameter inference.” Bioinformatics 35 (10): 1720–28. https://doi.org/10.1093/bioinformatics/bty867.

Wright, Marvin N, and Andreas Ziegler. 2015. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in c++ and r.” arXiv Preprint arXiv:1508.04409.

[^1]: The term “online” there and in the code has not the usual meaning it has, as coined in “online machine learning”. We still need the entire training data set at once. Our implementation is an “online” one not by the sequential order of the input data, but by the sequential order of computation of the trees in random forests, sequentially computed and then discarded.

[^2]: We only use the C++ Core of ranger, which is under MIT License, same as ours.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyabcranger-0.0.65.tar.gz (56.0 kB view details)

Uploaded Source

Built Distributions

pyabcranger-0.0.65-cp311-cp311-win_amd64.whl (614.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

pyabcranger-0.0.65-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.65-cp311-cp311-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

pyabcranger-0.0.65-cp310-cp310-win_amd64.whl (614.1 kB view details)

Uploaded CPython 3.10 Windows x86-64

pyabcranger-0.0.65-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.65-cp310-cp310-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pyabcranger-0.0.65-cp39-cp39-win_amd64.whl (612.4 kB view details)

Uploaded CPython 3.9 Windows x86-64

pyabcranger-0.0.65-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.65-cp39-cp39-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

pyabcranger-0.0.65-cp38-cp38-win_amd64.whl (614.0 kB view details)

Uploaded CPython 3.8 Windows x86-64

pyabcranger-0.0.65-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.65-cp38-cp38-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pyabcranger-0.0.65-cp37-cp37m-win_amd64.whl (614.5 kB view details)

Uploaded CPython 3.7m Windows x86-64

pyabcranger-0.0.65-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.65-cp37-cp37m-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file pyabcranger-0.0.65.tar.gz.

File metadata

  • Download URL: pyabcranger-0.0.65.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for pyabcranger-0.0.65.tar.gz
Algorithm Hash digest
SHA256 5641e4c596e206cbf35c07ff8bfff5a3bf8825ae4b76197e7845d8385140838e
MD5 31c4ca49bfe5f0288a87ddd59ea15a37
BLAKE2b-256 c7052eb1927ba31660405df879f3a46d705ef356a622c57984da4b8c4e9cfd1e

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7ac484d3a7c4cf35dae4df78fe6294a64cda33a1ac4d92d5ea960298868f7147
MD5 a52f5900f6442b4d75892722d71d1d80
BLAKE2b-256 4fc470996e8322cbad662fd7fbf65b4283873344e723b3e3da5ec1349160ca67

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3b266e82889bd2345093bdf9dd03f61890bd2e7eb0c643c29ccc3dc6bbe42aa2
MD5 d37df68729fcf8a94367ac9f3ab14020
BLAKE2b-256 aaa7e3a878fb030868ae8fa7d1a27ab3ca53bd4ead769117eb8c0e434308f62c

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 eccf4d1fc11a1544575aead431b1b5c149c5117c9414b44ebb2be3f6a44583e5
MD5 01495082b0ed3267b56ae15359cca05b
BLAKE2b-256 d16db4df57cce146ed2c2313020d263806a24b5943aa4b141a3feb27bb4f665b

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4b6dae30a7a1e26acd44f1efbca1c61126780c0a7dbb67ac191304593b4b76ca
MD5 acb5132fd6073c504abe89c2532b2a41
BLAKE2b-256 c614736883a6caabc2fe14a525709bb92711bd196073a09f3912a5c6d9f8fa9a

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bf8f2ea35be7d16b74e77fca48232a49a386023252628d4f26e22e3ce0a5432e
MD5 27c30d8488a851e5072409804c9ae19e
BLAKE2b-256 261700afb0683537873d9cb58ad3ecbbe59dbfaff818ceae338431b82c45a522

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c3d85906f1a85f78d458a0d6d3e69b0b13c56127dd420cffb2ca81a22bedc386
MD5 fa6955fca0470bd45822792007f74176
BLAKE2b-256 78b1f81c08f05d0134d56a43a97c2eaf21ae59085228b5527348c4d036f8a044

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e3e2cc96b2f0fe376d90dfe3e7465b39ca50bc2acf31122026de3f115dd3f1ff
MD5 4eecf8f319945f52fe9d9f29971f66f0
BLAKE2b-256 153989e77a424af50f59653ea01d272bb88f65253064ce4e8de8f83222358712

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0c20db91879c48ba9f39d34e7e6a482894eff557c9df97bfac6ff96222440f09
MD5 126b969281b366846ce309b35d84a856
BLAKE2b-256 84d37832690266f5dee4c0890b508fe39129e0a9a1857dff390744591191c82e

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a65c4bb860f03a1a5b6e2e1455c061e1e822c3d3b0dc8178747968dec14c0772
MD5 a744f3a9fb972df224d0ce7716f351f4
BLAKE2b-256 a6432d332da7cb41835c74ecafe64223c6acc17ccd7b3a1d63c3973aaa1b335b

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 577161dcd7ddf64fa935d212af214de27ce7557d40aee59d5dec3484456630f3
MD5 439b0b66bc214a78272882860e1d7fcd
BLAKE2b-256 8ba6fed2f295d993d115f79eb1432c74ca92d7c595f899827cb45fb45e68f124

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ccdab268e4c5361d5d71da9befc279e8e7b193091699b31f8ba932223f382379
MD5 e3a3267caf06b7fc73c293fd725f7845
BLAKE2b-256 cd35a0c04a9aa326380e656bb70d236e9df496ec2752f5cc934700eccd32b17a

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c8ffebfb9c050265cd225291ec47abe5785361bfdfbe4675c12022eda0c69007
MD5 6a039ebaf2e64507c376432c3d37d850
BLAKE2b-256 e238eb187ae15e0e7805c21e731e74fe0b6c44e10db81388565427c5cf1ada16

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 1eeb90a24fbd6a96084b6cb65758dce4d6567468867df21fd6a383b90454ec4b
MD5 55a6e7cd74595adaf37a15d73ee6eb4a
BLAKE2b-256 f08a45f70787950af3ab22fd35077e8cd2d5638e91ecdbac12c3e36fb381524d

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 908b3df7e0238df16b0496322306c08a84351f99aab9987e8c112eede7043c85
MD5 fd3710a9fbc441dcafe07892798a0a94
BLAKE2b-256 648d7506b7248af9b9748457b0d844a543ed49dca16bfd43d65b8a656f3e5698

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.65-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.65-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8627a8e87f28dd8903e93213bb78821a7187ce546a903a5e7708e5626d16859c
MD5 7edf0fe1c38cbf6ab965ead8154b74a7
BLAKE2b-256 4b85bc285f16e073991f71f6bec628ba9355f3b033fcc13d5c6b75ad1e2455ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page