Skip to main content

ABC random forests for model choice and parameter estimation, python wrapper

Project description

ABC random forests for model choice and parameters estimation

PyPI abcranger-build

Random forests methodologies for :

Libraries we use :

As a mention, we use our own implementation of LDA and PLS from (Friedman, Hastie, and Tibshirani 2001, 1:81, 114), PLS is optimized for univariate, see 5.1. For linear algebra optimization purposes on large reftables, the Linux version of binaries (standalone and python wheel) are statically linked with Intel’s Math Kernel Library, in order to leverage multicore and SIMD extensions on modern cpus.

There is one set of binaries, which contains a Macos/Linux/Windows (x64 only) binary for each platform. There are available within the “Releases” tab, under “Assets” section (unfold it to see the list).

This is pure command line binary, and they are no prerequisites or library dependencies in order to run it. Just download them and launch them from your terminal software of choice. The usual caveats with command line executable apply there : if you’re not proficient with the command line interface of your platform, please learn some basics or ask someone who might help you in those matters.

The standalone is part of a specialized Population Genetics graphical interface DIYABC-RF, presented in MER (Molecular Ecology Resources, Special Issue), (Collin et al. 2021).

Python

Installation

pip install pyabcranger

Notebooks examples

Usage

 - ABC Random Forest - Model choice or parameter estimation command line options
Usage:
  ../build/abcranger [OPTION...]

  -h, --header arg        Header file (default: headerRF.txt)
  -r, --reftable arg      Reftable file (default: reftableRF.bin)
  -b, --statobs arg       Statobs file (default: statobsRF.txt)
  -o, --output arg        Prefix output (modelchoice_out or estimparam_out by
                          default)
  -n, --nref arg          Number of samples, 0 means all (default: 0)
  -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
                          5 for regression (default: 0)
  -t, --ntree arg         Number of trees (default: 500)
  -j, --threads arg       Number of threads, 0 means all (default: 0)
  -s, --seed arg          Seed, generated by default (default: 0)
  -c, --noisecolumns arg  Number of noise columns (default: 5)
      --nolinear          Disable LDA for model choice or PLS for parameter
                          estimation
      --plsmaxvar arg     Percentage of maximum explained Y-variance for
                          retaining pls axis (default: 0.9)
      --chosenscen arg    Chosen scenario (mandatory for parameter
                          estimation)
      --noob arg          number of oob testing samples (mandatory for
                          parameter estimation)
      --parameter arg     name of the parameter of interest (mandatory for
                          parameter estimation)
  -g, --groups arg        Groups of models
      --help              Print help
  • If you provide --chosenscen, --parameter and --noob, parameter estimation mode is selected.
  • Otherwise by default it’s model choice mode.
  • Linear additions are LDA for model choice and PLS for parameter estimation, “–nolinear” options disables them in both case.

Model Choice

Terminal model choice

Example

Example :

abcranger -t 10000 -j 8

Header, reftable and statobs files should be in the current directory.

Groups

With the option -g (or --groups), you may “group” your models in several groups splitted . For example if you have six models, labeled from 1 to 6 `-g “1,2,3;4,5,6”

Generated files

Four files are created :

  • modelchoice_out.ooberror : OOB Error rate vs number of trees (line number is the number of trees)
  • modelchoice_out.importance : variables importance (sorted)
  • modelchoice_out.predictions : votes, prediction and posterior error rate
  • modelchoice_out.confusion : OOB Confusion matrix of the classifier

Parameter Estimation

Terminal estim param

Composite parameters

When specifying the parameter (option --parameter), one may specify simple composite parameters as division, addition or multiplication of two existing parameters. like t/N or T1+T2.

A note about PLS heuristic

The --plsmaxvar option (defaulting at 0.90) fixes the number of selected pls axes so that we get at least the specified percentage of maximum explained variance of the output. The explained variance of the output of the m first axes is defined by the R-squared of the output:

Yvar^m = \frac{\sum_{i=1}^{N}{(\hat{y}^{m}_{i}-\bar{y})^2}}{\sum_{i=1}^{N}{(y_{i}-\hat{y})^2}}

where \hat{y}^{m} is the output Y scored by the pls for the mth component. So, only the n_{comp} first axis are kept, and :

n_{comp} = \underset{Yvar^m \leq{} 0.90*Yvar^M, }{\operatorname{argmax}}

Note that if you specify 0 as --plsmaxvar, an “elbow” heuristic is activiated where the following condition is tested for every computed axis :

\frac{Yvar^{k+1}+Yvar^{k}}{2} \geq 0.99(N-k)\left(Yvar^{k+1}-Yvar^ {k}\right)

If this condition is true for a windows of previous axes, sized to 10% of the total possible axis, then we stop the PLS axis computation.

In practice, we find this n_{heur} close enough to the previous n_{comp} for 99%, but it isn’t guaranteed.

The signification of the noob parameter

The median global/local statistics and confidence intervals (global) measures for parameter estimation need a number of OOB samples (--noob) to be reliable (typlially 30% of the size of the dataset is sufficient). Be aware than computing the whole set (i.e. assigning --noob the same than for --nref) for weights predictions (Raynal et al. 2018) could be very costly, memory and cpu-wise, if your dataset is large in number of samples, so it could be adviseable to compute them for only choose a subset of size noob.

Example (parameter estimation)

Example (working with the dataset in test/data) :

abcranger -t 1000 -j 8 --parameter ra --chosenscen 1 --noob 50

Header, reftable and statobs files should be in the current directory.

Generated files (parameter estimation)

Five files (or seven if pls activated) are created :

  • estimparam_out.ooberror : OOB MSE rate vs number of trees (line number is the number of trees)
  • estimparam_out.importance : variables importance (sorted)
  • estimparam_out.predictions : expectation, variance and 0.05, 0.5, 0.95 quantile for prediction
  • estimparam_out.predweights : csv of the value/weights pairs of the prediction (for density plot)
  • estimparam_out.oobstats : various statistics on oob (MSE, NMSE, NMAE etc.)

if pls enabled :

  • estimparam_out.plsvar : variance explained by number of components
  • estimparam_out.plsweights : variable weight in the first component (sorted by absolute value)

Various

Partial Least Squares algorithm

  1. X_{0}=X ; y_{0}=y
  2. For k=1,2,...,s :
    1. w_{k}=\frac{X_{k-1}^{T} y_{k-1}}{y_{k-1}^{T} y_{k-1}}
    2. Normalize w_k to 1
    3. t_{k}=\frac{X_{k-1} w_{k}}{w_{k}^{T} w_{k}}
    4. p_{k}=\frac{X_{k-1}^{T} t_{k}}{t_{k}^{T} t_{k}}
    5. X_{k}=X_{k-1}-t_{k} p_{k}^{T}
    6. q_{k}=\frac{y_{k-1}^{T} t_{k}}{t_{k}^{T} t_{k}}
    7. u_{k}=\frac{y_{k-1}}{q_{k}}
    8. y_{k}=y_{k-1}-q_{k} t_{k}

Comment When there isn’t any missing data, stages 2.1 and 2.2 could be replaced by w_{k}=\frac{X_{k-1}^{T} y_{k-1}}{\left\|X_{k-1}^{T} y_{k-1}\right\|} and 2.3 by t_{k}=X_{k-1}w_{k}

To get W so that T=XW we compute :

\mathbf{W}=\mathbf{W}^{*}\left(\widetilde{\mathbf{P}} \mathbf{W}^{*}\right)^{-1}

where \widetilde{\mathbf{P}}_{K \times p}=\mathbf{t}\left[p_{1}, \ldots, p_{K}\right] where \mathbf{W}^{*}_{p \times K} = [w_1, \ldots, w_K]

TODO

Input/Output

  • Integrate hdf5 (or exdir? msgpack?) routines to save/load reftables/observed stats with associated metadata
  • Provide R code to save/load the data
  • Provide Python code to save/load the data

C++ standalone

  • Merge the two methodologies in a single executable with the (almost) the same options
  • (Optional) Possibly move to another options parser (CLI?)

External interfaces

  • R package
  • Python package

Documentation

  • Code documentation
  • Document the build

Continuous integration

  • Linux CI build with intel/MKL optimizations
  • osX CI build
  • Windows CI build

Long/Mid term TODO

  • methodologies parameters auto-tuning
    • auto-discovering the optimal number of trees by monitoring OOB error
    • auto-limiting number of threads by available memory
  • Streamline the two methodologies (model choice and then parameters estimation)
  • Write our own tree/rf implementation with better storage efficiency than ranger
  • Make functional tests for the two methodologies
  • Possible to use mondrian forests for online batches ? See (Lakshminarayanan, Roy, and Teh 2014)

References

This have been the subject of a proceedings in JOBIM 2020, PDF and video (in french), (Collin et al. 2020).

Collin, François-David, Ghislain Durif, Louis Raynal, Eric Lombaert, Mathieu Gautier, Renaud Vitalis, Jean-Michel Marin, and Arnaud Estoup. 2021. “Extending Approximate Bayesian Computation with Supervised Machine Learning to Infer Demographic History from Genetic Polymorphisms Using DIYABC Random Forest.” Molecular Ecology Resources 21 (8): 2598–2613. https://doi.org/https://doi.org/10.1111/1755-0998.13413.

Collin, François-David, Arnaud Estoup, Jean-Michel Marin, and Louis Raynal. 2020. “Bringing ABC inference to the machine learning realm : AbcRanger, an optimized random forests library for ABC.” In JOBIM 2020, 2020:66. JOBIM. Montpellier, France. https://hal.archives-ouvertes.fr/hal-02910067.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. 10. Springer series in statistics New York, NY, USA:

Guennebaud, Gaël, Benoît Jacob, et al. 2010. “Eigen V3.” http://eigen.tuxfamily.org.

Lakshminarayanan, Balaji, Daniel M Roy, and Yee Whye Teh. 2014. “Mondrian Forests: Efficient Online Random Forests.” In Advances in Neural Information Processing Systems, 3140–48.

Lintusaari, Jarno, Henri Vuollekoski, Antti Kangasrääsiö, Kusti Skytén, Marko Järvenpää, Pekka Marttinen, Michael U. Gutmann, Aki Vehtari, Jukka Corander, and Samuel Kaski. 2018. “ELFI: Engine for Likelihood-Free Inference.” Journal of Machine Learning Research 19 (16): 1–7. http://jmlr.org/papers/v19/17-374.html.

Pudlo, Pierre, Jean-Michel Marin, Arnaud Estoup, Jean-Marie Cornuet, Mathieu Gautier, and Christian P Robert. 2015. “Reliable ABC Model Choice via Random Forests.” Bioinformatics 32 (6): 859–66.

Raynal, Louis, Jean-Michel Marin, Pierre Pudlo, Mathieu Ribatet, Christian P Robert, and Arnaud Estoup. 2018. “ABC random forests for Bayesian parameter inference.” Bioinformatics 35 (10): 1720–28. https://doi.org/10.1093/bioinformatics/bty867.

Wright, Marvin N, and Andreas Ziegler. 2015. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in c++ and r.” arXiv Preprint arXiv:1508.04409.

[^1]: The term “online” there and in the code has not the usual meaning it has, as coined in “online machine learning”. We still need the entire training data set at once. Our implementation is an “online” one not by the sequential order of the input data, but by the sequential order of computation of the trees in random forests, sequentially computed and then discarded.

[^2]: We only use the C++ Core of ranger, which is under MIT License, same as ours.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyabcranger-0.0.59.tar.gz (55.9 kB view details)

Uploaded Source

Built Distributions

pyabcranger-0.0.59-cp310-cp310-win_amd64.whl (613.4 kB view details)

Uploaded CPython 3.10 Windows x86-64

pyabcranger-0.0.59-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.59-cp310-cp310-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pyabcranger-0.0.59-cp39-cp39-win_amd64.whl (611.5 kB view details)

Uploaded CPython 3.9 Windows x86-64

pyabcranger-0.0.59-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.59-cp39-cp39-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

pyabcranger-0.0.59-cp38-cp38-win_amd64.whl (613.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

pyabcranger-0.0.59-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.59-cp38-cp38-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pyabcranger-0.0.59-cp37-cp37m-win_amd64.whl (613.7 kB view details)

Uploaded CPython 3.7m Windows x86-64

pyabcranger-0.0.59-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pyabcranger-0.0.59-cp37-cp37m-macosx_10_9_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file pyabcranger-0.0.59.tar.gz.

File metadata

  • Download URL: pyabcranger-0.0.59.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for pyabcranger-0.0.59.tar.gz
Algorithm Hash digest
SHA256 e2168a317c9bfe8546a25153880d2811c26f84e3f95e7c8df2965ec5b062f810
MD5 79414816c63a14e220eab8a81ba55159
BLAKE2b-256 952d21ed84de09c983715924f4f001930be42bf9d2277aedf1da8d497fd4e390

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0088222d24eb4526e20e74ccf1a18f6623d09c6d65a034e35cc27e2440bcac46
MD5 fd226026a3723444abb8e223b6c6ce9d
BLAKE2b-256 374185e61a3b7b7120ece414f31e54abcf8c0daed37b89cbad01f4d2a1f1f0cb

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e3d1759098759e8cfb47dcd9d486a5ab9c30631cb1c6a7f85f2f026c1a496b7d
MD5 0254682e19c68013431da9bea3daac30
BLAKE2b-256 38a71ded079328bde270e5702ac8e010fe9fb84662587ff623b8b9b8de55ba02

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1aba41c348e57fbb5c22111a334741cef5eb413ba2475bd8160e276c7f4452b8
MD5 bce009553866209956501cffdef9a111
BLAKE2b-256 4352c971481e782b51958a9707cc31003a008715db3b5958ab5dd23481497b8d

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 4375cd0f0ee4423ebbd9f9673c23cd464b1165db313f09d863ddac823603db63
MD5 afb61a2e642cfa2b60e03f9ae202dd15
BLAKE2b-256 8947a6fc45fedc46fee6c7d311cf6e92fa6f7eed23379846f46a49a19776f253

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a47948b4c0925b05889b26baa97bd1ccdf68531be145da0b1705dfdaccd7d50e
MD5 362c91e9f1ff8ee7148edbbe1fb37e38
BLAKE2b-256 d1d431230c6824ead1387f94d6e5483febe2e0c862c08e757727abb704e6b869

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a3f05ed258bdd5f6df5c51c9c5f5763d8a1481ad69838184477f03fb26dd748e
MD5 209fd85e235798b5863f89287532e5b7
BLAKE2b-256 a29f1c2b57cfed4039faad3f989f72ae99e23fe3e2770d07b23165e6064e4d84

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 fa8d78b3553258bf44afea61f39542671cc1ad5006d84f2ab2790fe3e7f6aa04
MD5 e9b78ddcbb6f41f0c06aee4dc32dd5f4
BLAKE2b-256 7069050d6857b025363e7a3ab89ce0f0064e0bd3084b1b6e97ad9ef51db79ce8

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 610f9bda45b809c830b7232db87ffd3bdb2ef273c428f071bb4afe2b6e0f901b
MD5 6775fe335da372e81981b5e1c2f7c7bd
BLAKE2b-256 c8df047f66dc19c3e8a997d88a36fd57123f1ab972a9d813c363c0365c2b0ec4

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f56ab531cdeafde1451c868c28d62faa5aa15f30225469f8f620edca6b6632ab
MD5 6e583b6fb9ee7a8ef665972547bc2622
BLAKE2b-256 1602a55b468adc1b6e8c378f76fe1315d4566b200d0a1f442d098e310d9831dd

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 868e18f1e32e5be0807753731fa5d19e6ab417cb4b318357ed403777686d1d3b
MD5 827335b17be1edc4d78500a88613c320
BLAKE2b-256 211e9f026f4b5ebc5c0280918006ccc9a55cf287bb9ec37d2d839102199efac9

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 66127d193c9fc4b615e27ce6ac7f3941f2d9a7e7e9d4f5a954b5e7e7619abeaa
MD5 1087734eef58b288bba9d6f218420ded
BLAKE2b-256 38357418f5612e3ba512ecad31d9604e88f0ac883b782466f236558fde19c7a3

See more details on using hashes here.

File details

Details for the file pyabcranger-0.0.59-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyabcranger-0.0.59-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7dcf8ce6f25ef99be00488a9027280c69fcbe08578c640182ff82425d4b6f56e
MD5 b0377c107cec7c2abd2807eb50d34f46
BLAKE2b-256 607e13c9d4efb94ef8e2d5cc723266de0bccf1d2118d28afef1e0375318c4411

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page