Skip to main content

LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies

Project description

LAVASET

PyPI version Downloads DOI License: GPL v3 DOI:10.1101/2023.10.20.563223 DOI:10.1093/bioinformatics/btae101

LAVASET (Latent Variable Stochastic Ensemble of Trees) is a Python package designed for ensemble learning in datasets with complex spatial, spectral, and temporal dependencies. The main method is described in our Bioinformatics paper: https://doi.org/10.1093/bioinformatics/btae101. An updated version of LAVASET, v1, includes CLIFI, a class-based directional feature importance as well as boosting algorithms described in our preprint: https://www.biorxiv.org/content/10.1101/2024.08.01.605982v1.full.pdf.

Features

  • Efficient Handling of Correlated Data: Optimized for datasets where traditional models struggle.
  • Cython-Powered Performance: Critical computations are implemented in Cython for efficiency.
  • Cross-Platform Compatibility: Tested and deployable across Linux, macOS, and Windows.
  • Integrated directional feature importance metric (v1.0.0): class-based and suitable for multi-class classification.

Installation

You can install LAVASET directly from PyPI:

pip install lavaset

Requirements

  • Python >= 3.7
  • NumPy
  • pandas
  • scikit-learn
  • scipy
  • Cython
  • joblib

Cython and NumPy are incorporated as build dependencies for LAVASET and are pre-installed before the package setup. If you encounter any issues during installation, especially regarding Cython or NumPy, consider installing these packages manually before proceeding with the LAVASET installation.

For macOS or Windows users

LAVASET is built on a Linux architecture that is compatible with various linux platforms via a Docker image, and it has builds for the following macOS and Windows platforms via cibuildwheel:

[macos]
python_configurations = [
  { identifier = "cp3x-macosx_x86_64", version = "3.7-3.12"},
  { identifier = "cp3x-macosx_arm64", version = "3.7-3.12"},
  { identifier = "cp3x-macosx_universal2", version = "3.7-3.12"}]

[windows]
python_configurations = [
  { identifier = "cp3x-win32", version = "3.7-3.12", arch = "32" },
  { identifier = "cp3x-win_amd64", version = "3.7-3.12", arch = "64" },
  { identifier = "cp3x-win_arm64", version = "3.9-3.12", arch = "ARM64" }]

Example Usage

A jupyter notebook with examples on how to import and use the LAVASET package can be found here. Briefly, the LAVASET model can be called as below:

model = LAVASET(ntrees=100, n_neigh=10, distance=False, nvartosample='sqrt', nsamtosample=0.5, oobe=True) 
  • ntrees: number of trees (or estimators) for the ensemble (int)
  • n_neigh: number of neighbors to take for the calculation of the latent variable; this excludes the feature that has been selected for split, therefore the latent variable is calculated by the total of n+1 features (int)
  • distance: parameter indicating whether the input for neighbor calculation is a distance matrix, default is False; if True, then n_neigh should be 0 (boolean)
  • nvartosample: the number of features picked for each split, 'sqrt' indicates the squared root of total number of features, if int then takes that specific number of features (string or int)
  • nsamtosample: the number of sample to consider for each tree, if float (like 0.5) then it considers float * total number of samples, if int then takes that specific number of samples (float or int)
  • oobe: parameter for calcualting the out-of-bag score, default=True (boolean)

If the input to the knn_calculation function is a distance matrix then:

model = LAVASET(ntrees=100, n_neigh=0, distance=True, nvartosample='sqrt', nsamtosample=0.5, oobe=True) 

knn = model.knn_calculation(distance_matrix, data_type='distance_matrix')

If the neighbors need to be calculated from the 1D spectrum ppm values of an HNMR dataset, then the input is the 1D array with the ppm values. Here the model parameters should be set as distance=False and n_neigh=k. The data_type parameter for the knn_calculation in this case will be set to 1D. All options include:

  • 'distance_matrix' is used for distance matrix input,
  • '1D' is used for 1D data like signals or spectra,
  • 'VCG' is used for VCG data,
  • 'other' is used for any other type of data, where it calculates the nearest neighbors based on the 2D data input.
knn = model.knn_calculation(mtbls1.columns[1:], data_type='1D')

Depending on whether you wish to use our new feature importance method (CLIFI) or the traditional RF method you can call different functions for your feature importance calculations that can be found in detail in the example notebooks.

Citing us

@article{10.1093/bioinformatics/btae101,
    author = {Kasapi, Melpomeni and Xu, Kexin and Ebbels, Timothy M D and O’Regan, Declan P and Ware, James S and Posma, Joram M},
    title = "{LAVASET: Latent variable stochastic ensemble of trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies}",
    journal = {Bioinformatics},
    pages = {btae101},
    year = {2024},
    month = {02},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btae101},
    url = {https://doi.org/10.1093/bioinformatics/btae101},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae101/56732749/btae101.pdf},
}

Contributing

Contributions to LAVASET are always welcome.

Issues

Please submit any issues or bugs via the GitHub issues page. Please include details about the LAVASET minor version used (lavaset.__version__) as well as any relevant input data.

Contributions

Please submit any changes via a pull request. These will be reviewed by the LAVASET team and merged in due course.

License

LAVASET is released under the GNU License. See the LICENSE file for more details.

Contact

For questions or feedback, please contact Melpi Kasapi at mk218@ic.ac.uk.

Visit our GitHub repository for more information and updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

LAVASET-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (128.5 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

LAVASET-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (128.3 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

LAVASET-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (513.2 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

LAVASET-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (519.5 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

LAVASET-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (484.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

LAVASET-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (487.5 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

File details

Details for the file LAVASET-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for LAVASET-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 750b34c5c423c3bd9e8cbb2103aa7813329420e0c16d4df92003f1ea119f8dbd
MD5 b1cea2b8a4fa95bd16559956f4020d93
BLAKE2b-256 bf57b726fadb230108483263be8b1c973ead62974b7b5daa940d6a8a104dae91

See more details on using hashes here.

File details

Details for the file LAVASET-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for LAVASET-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 760f379e880000d161452355d22ef4f4935e98085bd432a1d3fd0402490990a6
MD5 27142a7fae8cd83788d49254c695991b
BLAKE2b-256 6bb5f3555a7070d19c79813c49eb8e1463a620ee424375aa1a8e1546b6ae36d5

See more details on using hashes here.

File details

Details for the file LAVASET-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for LAVASET-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0b797394824142bbdba8a2864a05885a980c6e83135505569cedc3ace1d58263
MD5 e48ea04276cc1e99377a1a0a0ab8c7b9
BLAKE2b-256 78fe2d5cbb2a4469b9c36276802606d0779fdf03ba0ea081015cbe99fb3ca7fd

See more details on using hashes here.

File details

Details for the file LAVASET-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for LAVASET-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dd6cd59c549aeacfe56d2ce897fd31341a5df3d9eaa8284bd50e31dc4d285d97
MD5 71205308ed363b73ac9af70163639718
BLAKE2b-256 2199e7827d0a27a4799d4a3f3c771a3e51b7079cc6bb62a799d5e09ae8214a83

See more details on using hashes here.

File details

Details for the file LAVASET-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for LAVASET-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ffcd42cc66952a68c3ce00dcb2fe43d4e461aa16dcdfff6b5be50f1001cd40d2
MD5 e04bd6d9350ecb33389f5134f1bedcfe
BLAKE2b-256 2bc47f3e44a183860386cfa4336ba2f4354f855eceb008b77e035288e2652c0e

See more details on using hashes here.

File details

Details for the file LAVASET-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for LAVASET-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 91ef7f7f0a43ecd57292ba38c399426923304ce6439c1e4389b53e9a094b1e34
MD5 0923ebffa85fc82c88a9ae8c819db667
BLAKE2b-256 eb0fcf751e5a4c46289e5aa34a2b1eab28edf54cfec904b465ecb3ebdf76c8a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page