LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies
Project description
LAVASET
LAVASET (Latent Variable Stochastic Ensemble of Trees) is a Python package designed for ensemble learning in datasets with complex spatial, spectral, and temporal dependencies. The main method is described in our Bioinformatics paper: https://doi.org/10.1093/bioinformatics/btae101. An updated version of LAVASET, v1, includes CLIFI, a class-based directional feature importance as well as boosting algorithms described in our preprint: https://www.biorxiv.org/content/10.1101/2024.08.01.605982v1.full.pdf.
Features
- Efficient Handling of Correlated Data: Optimized for datasets where traditional models struggle.
- Cython-Powered Performance: Critical computations are implemented in Cython for efficiency.
- Cross-Platform Compatibility: Tested and deployable across Linux, macOS, and Windows.
- Integrated directional feature importance metric (v1.0.0): class-based and suitable for multi-class classification.
Installation
You can install LAVASET directly from PyPI:
pip install lavaset
Requirements
- Python >= 3.7
- NumPy
- pandas
- scikit-learn
- scipy
- Cython
- joblib
Cython and NumPy are incorporated as build dependencies for LAVASET and are pre-installed before the package setup. If you encounter any issues during installation, especially regarding Cython or NumPy, consider installing these packages manually before proceeding with the LAVASET installation.
For macOS or Windows users
LAVASET is built on a Linux architecture that is compatible with various linux platforms via a Docker image, and it has builds for the following macOS and Windows platforms via cibuildwheel
:
[macos]
python_configurations = [
{ identifier = "cp3x-macosx_x86_64", version = "3.7-3.12"},
{ identifier = "cp3x-macosx_arm64", version = "3.7-3.12"},
{ identifier = "cp3x-macosx_universal2", version = "3.7-3.12"}]
[windows]
python_configurations = [
{ identifier = "cp3x-win32", version = "3.7-3.12", arch = "32" },
{ identifier = "cp3x-win_amd64", version = "3.7-3.12", arch = "64" },
{ identifier = "cp3x-win_arm64", version = "3.9-3.12", arch = "ARM64" }]
Example Usage
A jupyter notebook with examples on how to import and use the LAVASET package can be found here. Briefly, the LAVASET model can be called as below:
model = LAVASET(ntrees=100, n_neigh=10, distance=False, nvartosample='sqrt', nsamtosample=0.5, oobe=True)
- ntrees: number of trees (or estimators) for the ensemble (int)
- n_neigh: number of neighbors to take for the calculation of the latent variable; this excludes the feature that has been selected for split, therefore the latent variable is calculated by the total of n+1 features (int)
- distance: parameter indicating whether the input for neighbor calculation is a distance matrix, default is False; if True, then n_neigh should be 0 (boolean)
- nvartosample: the number of features picked for each split, 'sqrt' indicates the squared root of total number of features, if int then takes that specific number of features (string or int)
- nsamtosample: the number of sample to consider for each tree, if float (like 0.5) then it considers
float * total number of samples
, if int then takes that specific number of samples (float or int) - oobe: parameter for calcualting the out-of-bag score, default=True (boolean)
If the input to the knn_calculation
function is a distance matrix then:
model = LAVASET(ntrees=100, n_neigh=0, distance=True, nvartosample='sqrt', nsamtosample=0.5, oobe=True)
knn = model.knn_calculation(distance_matrix, data_type='distance_matrix')
If the neighbors need to be calculated from the 1D spectrum ppm values of an HNMR dataset, then the input is the 1D array with the ppm values. Here the model parameters should be set as distance=False
and n_neigh=k
. The data_type
parameter for the knn_calculation
in this case will be set to 1D
. All options include:
- 'distance_matrix' is used for distance matrix input,
- '1D' is used for 1D data like signals or spectra,
- 'VCG' is used for VCG data,
- 'other' is used for any other type of data, where it calculates the nearest neighbors based on the 2D data input.
knn = model.knn_calculation(mtbls1.columns[1:], data_type='1D')
Depending on whether you wish to use our new feature importance method (CLIFI) or the traditional RF method you can call different functions for your feature importance calculations that can be found in detail in the example notebooks.
Citing us
@article{10.1093/bioinformatics/btae101,
author = {Kasapi, Melpomeni and Xu, Kexin and Ebbels, Timothy M D and O’Regan, Declan P and Ware, James S and Posma, Joram M},
title = "{LAVASET: Latent variable stochastic ensemble of trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies}",
journal = {Bioinformatics},
pages = {btae101},
year = {2024},
month = {02},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btae101},
url = {https://doi.org/10.1093/bioinformatics/btae101},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae101/56732749/btae101.pdf},
}
Contributing
Contributions to LAVASET are always welcome.
Issues
Please submit any issues or bugs via the GitHub issues page. Please include details about the LAVASET minor version used (lavaset.__version__
) as well as any relevant input data.
Contributions
Please submit any changes via a pull request. These will be reviewed by the LAVASET team and merged in due course.
License
LAVASET is released under the GNU License. See the LICENSE file for more details.
Contact
For questions or feedback, please contact Melpi Kasapi at mk218@ic.ac.uk.
Visit our GitHub repository for more information and updates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file LAVASET-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: LAVASET-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 128.5 kB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 750b34c5c423c3bd9e8cbb2103aa7813329420e0c16d4df92003f1ea119f8dbd |
|
MD5 | b1cea2b8a4fa95bd16559956f4020d93 |
|
BLAKE2b-256 | bf57b726fadb230108483263be8b1c973ead62974b7b5daa940d6a8a104dae91 |
File details
Details for the file LAVASET-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: LAVASET-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 128.3 kB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 760f379e880000d161452355d22ef4f4935e98085bd432a1d3fd0402490990a6 |
|
MD5 | 27142a7fae8cd83788d49254c695991b |
|
BLAKE2b-256 | 6bb5f3555a7070d19c79813c49eb8e1463a620ee424375aa1a8e1546b6ae36d5 |
File details
Details for the file LAVASET-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: LAVASET-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 513.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b797394824142bbdba8a2864a05885a980c6e83135505569cedc3ace1d58263 |
|
MD5 | e48ea04276cc1e99377a1a0a0ab8c7b9 |
|
BLAKE2b-256 | 78fe2d5cbb2a4469b9c36276802606d0779fdf03ba0ea081015cbe99fb3ca7fd |
File details
Details for the file LAVASET-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: LAVASET-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 519.5 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd6cd59c549aeacfe56d2ce897fd31341a5df3d9eaa8284bd50e31dc4d285d97 |
|
MD5 | 71205308ed363b73ac9af70163639718 |
|
BLAKE2b-256 | 2199e7827d0a27a4799d4a3f3c771a3e51b7079cc6bb62a799d5e09ae8214a83 |
File details
Details for the file LAVASET-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: LAVASET-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 484.8 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffcd42cc66952a68c3ce00dcb2fe43d4e461aa16dcdfff6b5be50f1001cd40d2 |
|
MD5 | e04bd6d9350ecb33389f5134f1bedcfe |
|
BLAKE2b-256 | 2bc47f3e44a183860386cfa4336ba2f4354f855eceb008b77e035288e2652c0e |
File details
Details for the file LAVASET-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: LAVASET-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 487.5 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91ef7f7f0a43ecd57292ba38c399426923304ce6439c1e4389b53e9a094b1e34 |
|
MD5 | 0923ebffa85fc82c88a9ae8c819db667 |
|
BLAKE2b-256 | eb0fcf751e5a4c46289e5aa34a2b1eab28edf54cfec904b465ecb3ebdf76c8a8 |