Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

astartes

Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays

astarteslogo

GitHub Repo Stars PyPI - Downloads PyPI PyPI - License Test Status

Installing `astartes`

We reccomend installing astartes within a virtual environment, using either venv or conda (or other tools) to simplify dependency management.

astartes is availble on PyPI and can be installed using pip:

To include the featurization options for chemical data, use pip install astartes[molecules].
To install only the sampling algorithms, use pip install astartes (this install will have fewer depdencies and may be more readily compatible in environments with existing workflows).

Using `astartes`

astartes is designed as a drop-in replacement for sklearn's train_test_split function. To switch to astartes, change from sklearn.model_selection import train_test_split to 'from astartes import train_test_split`.

By default, astartes will use a random splitting approach identical to that which is implemented in sklearn, and a variety of deterministic sampling approaches can be used by specifying one additional argument ot the function:

X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  sampler = 'kennard_stone',  # any of the supported samplers
)

There are two broad categories of sampling algorithms implemented in astartes: supervised (requires labeled data) and unsupervised. All can be accessed via train_test_split, but supervised algorithms require an additional argument labels to be specified:

X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  sampler = 'time_split',  # any of the supported samplers
)

Here is a list of all implement sampling algorithms:

Sampler Name	Usage String	Type	Reference	Notes
Random	'random'	Interpolative	sklearn `train_test_split`	This sampler is a direct passthrough to sklearn's `train_test_split`.
Scaffold	'scaffold'	Extrapolative	`chemprop`'s `scaffold_split`	This sampler is
Sphere Exclusion	'sphere_exclusion'	Extrapolative	custom implementation	Variation on Sphere Exclusion for arbitrary-valued vectors

Using the `astartes.molecules` Subpackage

After installing with pip install astartes[molecules] one can import the new train/test splitting function like this: from astartes.molecules import train_test_split_molecules

The usage of this function is identical to train_test_split but with the addition of new arguments to control how the molecules are featurized:

train_test_split_molecules(
    smiles=smiles,
    y=y,
    test_size=0.2,
    train_size=0.8,
    fingerprint="daylight_fingerprint",
    fprints_hopts={
        "minPath": 2,
        "maxPath": 5,
        "fpSize": 200,
        "bitsPerHash": 4,
        "useHs": 1,
        "tgtDensity": 0.4,
        "minSize": 64,
    },
    splitter="random",
    hopts={
        "random_state": 42,
        "shuffle": True,
    },
)

Configuraiton options for the featurization scheme can be found in the documentation for AIMSim.

Online Documentation

Click here to read the documentation

Background

Rational Splitting Algorithms

While much machine learning is done with a random choice between training/test/validation data, an alternative is the use of so-called "rational" splitting algorithms. These approaches use some similarity-based algorithm to divide data into sets. Some of these algorithms include Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms as discussed by Tropsha et. al as well as the DUPLEX, OptiSim, D-optimal, as discussed in Applied Chemoinformatics: Achievements and Future Opportunities. Some clustering-based splitting techniques have also been introduced, such as DBSCAN.

Sampling Algorithms

Random
Kennard-Stone (KS)
Minimal Test Set Dissimilarity
Sphere Exclusion
DUPLEX
OptiSim
D-Optimal
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
KMEANS Split
SPXY
RBM
Time Split

Development

To install the most updated release of astartes for development purposes, use pip install -e --target=. asartes[molecules] or clone this repository. Pull requests are welcome!

Adding New Samplers

Adding a new sampler should extend the abstract_sampler.py abstract base class.

It can be as simple as a passthrough to a another train_test_split, or it can be an original implementation that results in X and y being split into two lists. Take a look at astartes/samplers/random_split.py for a basic example!

Adding New Featurization Schemes

All of the sampling methods implemented in astartes accept arbitrary arrays of numbers and return the sampled groups -- if you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in astartes.

Adding a new interface should take on this format:

from extended_train_test_split import train_test_split

def train_test_split_INTERFACE(
    INTERFACE_input,
    INTERFACE_ARGS,
    y: np.array = None,
    test_size: float = 0.25,
    train_size: float = 0.75,
    splitter: str = 'random',
    hopts: dict = {},
    INTERFACE_hopts: dict = {},
):
    # turn the INTERFACE_input into an input X
    # based on INTERFACE ARGS where INTERFACE_hopts
    # specifies additional behavior
    X = []
    
    # call train test split with this input
    return train_test_split(
        X,
        y=y,
        test_size=test_size,
        train_size=train_size,
        splitter=splitter,
        hopts=hopts,
    )

JORS Branch

paper.tex is stored in a separate branch aptly named jors-paper. To push changes from the main branch into the jors-paper branch, run the Update JORS Branch workflow.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.2

Mar 1, 2024

1.2.1

Feb 16, 2024

1.2.0 yanked

Feb 15, 2024

Reason this release was yanked:

regression affecting minimal install

1.1.5

Nov 27, 2023

1.1.4

Nov 12, 2023

1.1.3.post1

Oct 16, 2023

1.1.3

Oct 11, 2023

1.1.2

Jul 17, 2023

1.1.1

Jul 2, 2023

1.1.0

Jun 28, 2023

1.0.3

Jun 13, 2023

1.0.2

Jun 6, 2023

1.0.1

Jun 4, 2023

1.0.0

May 1, 2023

1.0.0rc1 pre-release

Mar 24, 2023

1.0.0b2 pre-release

Mar 15, 2023

1.0.0b1 pre-release

Mar 15, 2023

1.0.0b0 pre-release

Mar 7, 2023

This version

1.0.0a4 pre-release

Feb 16, 2023

0.0.0

Apr 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astartes-1.0.0a4.tar.gz (8.7 kB view hashes)

Uploaded Feb 16, 2023 Source

Built Distribution

astartes-1.0.0a4-py3-none-any.whl (7.6 kB view hashes)

Uploaded Feb 16, 2023 Python 3

Hashes for astartes-1.0.0a4.tar.gz

Hashes for astartes-1.0.0a4.tar.gz
Algorithm	Hash digest
SHA256	`10ba7cec55f4c490469bc069cc37c1e8d87fa28c690ae162182f7cb3af06e844`
MD5	`1ce6cdc894277a05da72e1777e273f70`
BLAKE2b-256	`906ee1e4ff016adcb58dcacd2411a000035a5afdc317e2a0eb5d3a07f9bf4813`

Hashes for astartes-1.0.0a4-py3-none-any.whl

Hashes for astartes-1.0.0a4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85ca427ed08b9fa4f8aeee271ad1c3258d6ca4f2c61a2054a38dad70a4ab0d82`
MD5	`784773c8703a3f39ed5691e87bd9717a`
BLAKE2b-256	`dec939eab70f6967094d22a63fd9e0e655917a0bc1914969a4ed6a402cc1e528`

astartes 1.0.0a4

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

astartes

Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays

Installing `astartes`

Using `astartes`

Using the `astartes.molecules` Subpackage

Online Documentation

Background

Rational Splitting Algorithms

Sampling Algorithms

Development

Adding New Samplers

Adding New Featurization Schemes

JORS Branch

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

astartes 1.0.0a4

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

astartes

Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays

Installing astartes

Using astartes

Using the astartes.molecules Subpackage

Online Documentation

Background

Rational Splitting Algorithms

Sampling Algorithms

Development

Adding New Samplers

Adding New Featurization Schemes

JORS Branch

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Installing `astartes`

Using `astartes`

Using the `astartes.molecules` Subpackage