Train:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

astartes

Train:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

astarteslogo

GitHub Repo Stars PyPI - Downloads PyPI PyPI - Python Version PyPI - License Test Status Reproduce Paper

Installing `astartes`

We recommend installing astartes within a virtual environment, using either venv or conda (or other tools) to simplify dependency management. Python versions 3.7, 3.8, 3.9, 3.10, and 3.11 are supported on all platforms.

astartes is available on PyPI and can be installed using pip:

To include the featurization options for chemical data, use pip install astartes[molecules].
To install only the sampling algorithms, use pip install astartes (this install will have fewer dependencies and may be more readily compatible in environments with existing workflows).

Note for Windows Powershell or MacOS Catalina or newer: On these systems the command line will complain about square brackets, so you will need to double quote the molecules command (i.e. pip install "astartes[molecules]")

Using `astartes`

astartes is designed as a drop-in replacement for sklearn's train_test_split function. To switch to astartes, change from sklearn.model_selection import train_test_split to from astartes import train_test_split.

By default, astartes will split data randomly. Additionally, a variety of algorithmic sampling approaches can be used by specifying the sampler argument to the function:

X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  sampler = 'kennard_stone',  # any of the supported samplers
)

Paper

For a comprehensive walkthrough of the theory and implementation of astartes, follow this link to read the companion paper.

Example Notebooks

Click the badges in the table below to be taken to a live, interactive demo of astartes:

Demo	Link
Using `train_val_test_split` with the `sklearn` example datasets
Comparing Sampling Algorithms with Fast Food
Cheminformatics sample set partitioning with `astartes`
Comparing partitioning approaches for alkanes

Rational Splitting Algorithms

While much machine learning is done with a random choice between training/validation/test data, an alternative is the use of so-called "rational" splitting algorithms. These approaches use some similarity-based algorithm to divide data into sets. Some of these algorithms include Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms as discussed by Tropsha et. al as well as the OptiSim as discussed in Applied Chemoinformatics: Achievements and Future Opportunities. Some clustering-based splitting techniques have also been incorporated, such as DBSCAN.

There are two broad categories of sampling algorithms implemented in astartes: extrapolative and interpolative. The former will force your model to predict on out-of-sample data, which creates a more challenging task than interpolative sampling. See the table below for all of the sampling approaches currently implemented in astartes, as well as the hyperparameters that each algorithm accepts (which are passed in with hopts) and a helpful reference for understanding how the hyperparameters work. Note that random_state is defined as a keyword argument in train_test_split itself, even though these algorithms will use the random_state in their own work. Do not provide a random_state in the hopts dictionary - it will be overwritten by the random_state you provide for train_test_split (or the default if none is provided).

Implemented Sampling Algorithms

Sampler Name	Usage String	Type	Hyperparameters	Reference	Notes
Random	'random'	Interpolative	`shuffle`	`sklearn train_test_split`	This sampler is a direct passthrough to `sklearn`'s `train_test_split`, though it does not currently reproduce splits identically.
Kennard-Stone	'kennard_stone'	Interpolative	`metric`	Kennard & Stone	Euclidian distance is used by default, as described in the original paper.
Sample set Partitioning based on joint X-Y distances (SPXY)	'spxy'	Interpolative	`distance_metric`	Saldhana et. al original paper	Extension of Kennard Stone that also includes the response when sampling distances.
Scaffold	'scaffold'	Extrapolative	`include_chirality`	Bemis-Murcko Scaffold as implemented in RDKit	This sampler requires SMILES strings as input (use the `molecules` subpackage)
Sphere Exclusion	'sphere_exclusion'	Extrapolative	`metric`, `distance_cutoff`	custom implementation	Variation on Sphere Exclusion for arbitrary-valued vectors.
Time Based	'time_based'	Extrapolative	none	Chen et al., Sheridan, R. P, Feinberg et al., Struble et al.	This sampler requires `labels` to be an iterable of either date or datetime objects.
Optimizable K-Dissimilarity Selection (OptiSim)	'optisim'	Extrapolative	`n_clusters`, `max_subsample_size`, `distance_cutoff`	custom implementation	Variation on OptiSim for arbitrary-valued vectors.
K-Means	'kmeans'	Extrapolative	`n_clusters`, `n_init`	`sklearn KMeans`	Passthrough to `sklearn`'s `KMeans`.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)	'dbscan'	Extrapolative	`eps`, `min_samples`, `algorithm`, `metric`, `leaf_size`	`sklearn DBSCAN`	Passthrough to `sklearn`'s `DBSCAN`.
Minimum Test Set Dissimilarity (MTSD)	~	~	upcoming in `astartes` v1.x	~	~
Restricted Boltzmann Machine (RBM)	~	~	upcoming in `astartes` v1.x	~	~
Kohonen Self-Organizing Map (SOM)	~	~	upcoming in `astartes` v1.x	~	~
SPlit Method	~	~	upcoming in `astartes` v1.x	~	~

Using the `astartes.molecules` Subpackage

After installing with pip install astartes[molecules] one can import the new train/test splitting function like this: from astartes.molecules import train_test_split_molecules

The usage of this function is identical to train_test_split but with the addition of new arguments to control how the molecules are featurized:

train_test_split_molecules(
    molecules=smiles,
    y=y,
    test_size=0.2,
    train_size=0.8,
    fingerprint="daylight_fingerprint",
    fprints_hopts={
        "minPath": 2,
        "maxPath": 5,
        "fpSize": 200,
        "bitsPerHash": 4,
        "useHs": 1,
        "tgtDensity": 0.4,
        "minSize": 64,
    },
    sampler="random",
    random_state=42,
    hopts={
        "shuffle": True,
    },
)

To see a complete example of using train_test_split_molecules with actual chemical data, take a look in the examples directory.

Configuration options for the featurization scheme can be found in the documentation for AIMSim though most of the critical configuration options are shown above.

Reproducibility

astartes aims to be completely reproducible across different platforms, Python versions, and dependency configurations - any version of astartes v1.x should result in the exact same splits, always. To that end, the default behavior of astartes is to use 42 as the random seed and always set it. Running astartes with the default settings will always produce the exact same results. We have verified this behavior on Debian Ubuntu, Windows, and Intel Macs from Python versions 3.7 through 3.11 (with appropriate dependencies for each version). We are limited in our ability to test on M1 Macs, but from our limited manual testing we achieve perfect reproducbility in all cases except occasionally with KMeans on Apple silicon. It has produced slightly different results between platforms regardless of random_state, with up to two clusters being assigned differently resulting in data splits which are >99% identical. astartes is still consistent between runs on the same platform in all cases.

Online Documentation

The online documentation contains everything you see in this README with an additional tutorial for moving from train_test_split in sklearn to astartes.

Contributing & Developer Notes

Pull Requests, Bug Reports, and all Contributions are welcome! Please use the appropriate issue or pull request template when making a contribution.

We make use of the GitHub Discussions page to go over potential features to add. Please feel free to stop by if you are looking for something to develop or have an idea for a useful feature!

When submitting a PR, please mark your PR with the "PR Ready for Review" label when you are finished making changes so that the GitHub actions bots can work their magic!

Developer Install

To contribute to the astartes source code, start by cloning the repository (i.e. git clone git@github.com:JacksonBurns/astartes.git) and then inside the repository run pip install -e .[molecules,dev]. This will set you up with all the required dependencies to run astartes and conform to our formatting standards (black and isort), which you can configure to run automatically in vscode like this.

Unit Testing

All of the tests in astartes are written using the built-in python unittest module (to allow running without pytest) but we highly recommend using pytest. To execute the tests from the astartes repository, simply type pytest after running the developer install (or alternately, pytest -v for a more helpful output).

Adding New Samplers

Adding a new sampler should extend the abstract_sampler.py abstract base class.

It can be as simple as a passthrough to a another train_test_split, or it can be an original implementation that results in X and y being split into two lists. Take a look at astartes/samplers/random_split.py for a basic example!

After the sampler has been implemented, add it to __init__.py in in astartes/samplers and it will automatically be unit tested. Additional unit tests to verify that hyperparameters can be properly passed, etc. are also recommended.

For historical reasons, and as a guide for any developers who would like add new samplers, below is a running list of samplers which have been considered for addition to asartes but ultimately not added for various reasons.

Not Implemented Sampling Algorithms

Sampler Name	Reasoning	Relevant Link(s)
D-Optimal	Requires a-priori knowledge of the test and train size which does not fit in the `astartes` framework (samplers are all agnostic to the size of the sets) and it is questionable if the use of the Fischer information matrix is actually meaningful in the context of sampling existing data rather than tuning for ideal data.	The Wikipedia article for optimal design does a good job explaining why this is difficult, and points at some potential alternatives.
Duplex	Requires knowing test and train size before execution, and can only partition data into two sets which would make it incompatible with `train_val_test_split`.	This implementation in R includes helpful references and a reference implementation.

Adding New Featurization Schemes

All of the sampling methods implemented in astartes accept arbitrary arrays of numbers and return the sampled groups (with the exception of Scaffold.py). If you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in astartes.

Adding a new interface should take on this format:

from astartes import train_test_split

def train_test_split_INTERFACE(
    INTERFACE_input,
    INTERFACE_ARGS,
    y: np.array = None,
    labels: np.array = None,
    test_size: float = 0.25,
    train_size: float = 0.75,
    splitter: str = 'random',
    hopts: dict = {},
    INTERFACE_hopts: dict = {},
):
    # turn the INTERFACE_input into an input X
    # based on INTERFACE ARGS where INTERFACE_hopts
    # specifies additional behavior
    X = []
    
    # call train test split with this input
    return train_test_split(
        X,
        y=y,
        labels=labels,
        test_size=test_size,
        train_size=train_size,
        splitter=splitter,
        hopts=hopts,
    )

If possible, we would like to also add an example Jupyter Notebook with any new interface to demonstrate to new users how it functions. See our other examples in the examples directory.

Contact @JacksonBurns if you need assistance adding an existing workflow to astartes. If this featurization scheme requires additional dependencies to function, we may add it as an additional extra package in the same way that molecules in installed.

JOSS Branch

astartes corresponding JOSS paper is stored in this repository on a separate branch. You can find paper.md on the aptly named joss-paper branch.

Note for Maintainers: To push changes from the main branch into the joss-paper branch, run the Update JOSS Branch workflow.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.2

Mar 1, 2024

1.2.1

Feb 16, 2024

1.2.0 yanked

Feb 15, 2024

Reason this release was yanked:

regression affecting minimal install

1.1.5

Nov 27, 2023

1.1.4

Nov 12, 2023

1.1.3.post1

Oct 16, 2023

1.1.3

Oct 11, 2023

1.1.2

Jul 17, 2023

1.1.1

Jul 2, 2023

1.1.0

Jun 28, 2023

1.0.3

Jun 13, 2023

1.0.2

Jun 6, 2023

1.0.1

Jun 4, 2023

This version

1.0.0

May 1, 2023

1.0.0rc1 pre-release

Mar 24, 2023

1.0.0b2 pre-release

Mar 15, 2023

1.0.0b1 pre-release

Mar 15, 2023

1.0.0b0 pre-release

Mar 7, 2023

1.0.0a4 pre-release

Feb 16, 2023

0.0.0

Apr 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astartes-1.0.0.tar.gz (29.6 kB view hashes)

Uploaded May 1, 2023 Source

Built Distribution

astartes-1.0.0-py3-none-any.whl (30.3 kB view hashes)

Uploaded May 1, 2023 Python 3

Hashes for astartes-1.0.0.tar.gz

Hashes for astartes-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`837d3b0580c2a31852ffdae7873c45ab53352689cbcddbf3e221ce56aa2c0615`
MD5	`4f5b476f297a4b0198dd719ae6453d86`
BLAKE2b-256	`8173c19863b16ab00eaa263f22731391b893a5004d88b85125e0c2fdfe1b73c6`

Hashes for astartes-1.0.0-py3-none-any.whl

Hashes for astartes-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ca946551b659aafc18324689ad77a88ea4de5346b8b94b3b823f7951c1724be`
MD5	`41343cc5b778675dcc8965085f7a6a43`
BLAKE2b-256	`44d9d0b7521311febda959aa26ba6fcddd95b041390acdd39c80c6909b8e7585`

astartes 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

astartes

Train:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

Installing `astartes`

Using `astartes`

Paper

Example Notebooks

Rational Splitting Algorithms

Implemented Sampling Algorithms

Using the `astartes.molecules` Subpackage

Reproducibility

Online Documentation

Contributing & Developer Notes

Developer Install

Unit Testing

Adding New Samplers

Not Implemented Sampling Algorithms

Adding New Featurization Schemes

JOSS Branch

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

astartes 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

astartes

Train:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

Installing astartes

Using astartes

Paper

Example Notebooks

Rational Splitting Algorithms

Implemented Sampling Algorithms

Using the astartes.molecules Subpackage

Reproducibility

Online Documentation

Contributing & Developer Notes

Developer Install

Unit Testing

Adding New Samplers

Not Implemented Sampling Algorithms

Adding New Featurization Schemes

JOSS Branch

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Installing `astartes`

Using `astartes`

Using the `astartes.molecules` Subpackage