Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays
Project description
astartes
Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays
Installing astartes
We reccomend installing astartes
within a virtual environment, using either venv
or conda
(or other tools) to simplify dependency management.
astartes
is availble on PyPI
and can be installed using pip
:
- To include the featurization options for chemical data, use
pip install astartes[molecules]
. - To install only the sampling algorithms, use
pip install astartes
(this install will have fewer depdencies and may be more readily compatible in environments with existing workflows).
Note for Windows Powershell or MacOS Catalina or newer: On these systems the command line will complain about square brackets, so you will need to double quote the molecules
command (i.e. pip install "astartes[molecules]"
)
Using astartes
astartes
is designed as a drop-in replacement for sklearn
's train_test_split
function. To switch to astartes
, change from sklearn.model_selection import train_test_split
to from astartes import train_test_split
.
By default, astartes
will use a random splitting approach identical to that which is implemented in sklearn
, and a variety of deterministic sampling approaches can be used by specifying one additional argument ot the function:
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
sampler = 'kennard_stone', # any of the supported samplers
)
Example Notebooks
Click the badges in the table below to be taken to a live, interactive demo of astartes
:
Demo | Link |
---|---|
Using train_val_test_split with the sklearn example datasets |
|
Cheminformatics sample set partitioning with astartes |
Rational Splitting Algorithms
While much machine learning is done with a random choice between training/test/validation data, an alternative is the use of so-called "rational" splitting algorithms. These approaches use some similarity-based algorithm to divide data into sets. Some of these algorithms include Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms as discussed by Tropsha et. al as well as the OptiSim as discussed in Applied Chemoinformatics: Achievements and Future Opportunities. Some clustering-based splitting techniques have also been introduced, such as DBSCAN.
There are two broad categories of sampling algorithms implemented in astartes
: extrapolative and interpolative. The former will force your model to predict on out-of-smaple data, effectively asking a 'harder question' than interpolative sampling. See the table below for all of the sampling approaches currently implemented in astartes
.
Implemented Sampling Algorithms
Sampler Name | Usage String | Type | Hyperparameters | Reference | Notes |
---|---|---|---|---|---|
Random | 'random' | Interpolative | random_state , shuffle |
sklearn train_test_split |
This sampler is a direct passthrough to sklearn 's train_test_split , though it does not currently reproduce splits identically. |
Kennard-Stone | 'kennard_stone' | Interpolative | none | yu9824's kennard_stone |
Fully deterministic, no hyperparameters accepted. |
Sample set Partitioning based on join X-Y distances (SPXY) | 'spxy' | Interpolative | distance_metric |
Saldhana et. al original paper | Extension of Kennard Stone that also includes the response when sampling distances. |
Scaffold | 'scaffold' | Extrapolative | explicit_hydrogens , include_chirality |
Bemis-Murcko Scaffold as implemented in RDKit | This sampler requires SMILES strings as input (use the molecules subpackage) |
Sphere Exclusion | 'sphere_exclusion' | Extrapolative | metric , random_state , distance_cutoff |
custom implementation | Variation on Sphere Exclusion for arbitrary-valued vectors. |
Optimizable K-Dissimilarity Selection (OptiSim) | 'optisim' | Extrapolative | random_state , n_clusters , max_subsample_size , distance_cutoff |
custom implementation | Variation on OptiSim for arbitrary-valued vectors. |
K-Means | 'kmeans' | Extrapolative | random_state , n_clusters , n_init |
sklearn KMeans |
Passthrough to sklearn 's KMeans . |
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | 'dbscan' | Extrapolative | eps , min_samples , algorithm , metric , leaf_size |
sklearn DBSCAN |
Passthrough to sklearn 's DBSCAN . |
Mimimm Test Set Dissimilarity | ~ | ~ | will be released with astartes v1.0.0 |
~ | ~ |
RBM Sampler | ~ | ~ | will be released with astartes v1.0.0 |
~ | ~ |
Using the astartes.molecules
Subpackage
After installing with pip install astartes[molecules]
one can import the new train/test splitting function like this: from astartes.molecules import train_test_split_molecules
The usage of this function is identical to train_test_split
but with the addition of new arguments to control how the molecules are featurized:
train_test_split_molecules(
smiles=smiles,
y=y,
test_size=0.2,
train_size=0.8,
fingerprint="daylight_fingerprint",
fprints_hopts={
"minPath": 2,
"maxPath": 5,
"fpSize": 200,
"bitsPerHash": 4,
"useHs": 1,
"tgtDensity": 0.4,
"minSize": 64,
},
splitter="random",
hopts={
"random_state": 42,
"shuffle": True,
},
)
To see a complete example of using train_test_split_molecules
with actual chemical data, take a look in the examples
directory.
Configuration options for the featurization scheme can be found in the documentation for AIMSim
though most of the critical configuration options are shown above.
Online Documentation
The online documentation contains everything you see in this README with an additional tutorial for moving from train_test_split
in sklearn
to astartes
.
Contributing & Developer Notes
Pull Requests, Bug Reports, and all Contributions are welcome! Please use the appropriate issue or pull request template when making a contribution.
When submitting a PR, please mark your PR with the "PR Ready for Review" label when you are finished making changes so that the GitHub actions bots can work their magic!
Developer Install
To contribute to the astartes
source code, start by cloning the repository (i.e. git clone git@github.com:JacksonBurns/astartes.git
) and then inside the repository run pip install -e .[molecules,dev]
. This will set you up with all the required dependencies to run astartes
and conform to our formatting standards (black
and isort
), which you can configure to run automatically in vscode like this.
Note for Windows Powershell or MacOS Catalina or newer: On these systems the command line will complain about square brackets, so you will need to double quote the molecules
command (i.e. pip install -e ".[molecules,dev]"
)
Unit Testing
All of the tests in astartes
are written using the built-in python unittest
module (to allow running without pytest
) but we highly reccomend using pytest
. To execute the tests from the astartes
repository, simply type pytest
after running the developer install (or alternately, pytest -v
for a more helpful output).
Adding New Samplers
Adding a new sampler should extend the abstract_sampler.py
abstract base class.
It can be as simple as a passthrough to a another train_test_split
, or it can be an original implementation that results in X and y being split into two lists. Take a look at astartes/samplers/random_split.py
for a basic example!
After the sampler has been implemented, add it to __init__.py
in in astartes/samplers
and it will automatically be unit tested. Additional unit tests to verify that hyperparameters can be properly passed, etc. are also reccomended.
For historical reasons, and as a guide for any developers who would like add new samplers, below is a running list of samplers which have been considered for addition to asartes
but ultimately not added for various reasons.
Not Implemented Sampling Algorithms
Sampler Name | Reasoning |
---|---|
D-Optimal | Requires a-priori knowledge of the test and train size which does not fit in the astartes framework (samplers are all agnostic to the size of the sets) and it is questionable if the use of the Fischer information matrix is actually meaningful in the context of sampling existing data rather than tuning for ideal data. |
Duplex | Requires knowing test and train size before execution, and can only partition data into two sets which owuld make it incompatible with train_val_test_split . |
Adding New Featurization Schemes
All of the sampling methods implemented in astartes
accept arbitrary arrays of numbers and return the sampled groups (with the exception of Scaffold.py
). If you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in astartes
.
Adding a new interface should take on this format:
from astartes import train_test_split
def train_test_split_INTERFACE(
INTERFACE_input,
INTERFACE_ARGS,
y: np.array = None,
labels: np.array = None,
test_size: float = 0.25,
train_size: float = 0.75,
splitter: str = 'random',
hopts: dict = {},
INTERFACE_hopts: dict = {},
):
# turn the INTERFACE_input into an input X
# based on INTERFACE ARGS where INTERFACE_hopts
# specifies additional behavior
X = []
# call train test split with this input
return train_test_split(
X,
y=y,
labels=labels,
test_size=test_size,
train_size=train_size,
splitter=splitter,
hopts=hopts,
)
If possible, we would like to also add an example Jupyter Notebook with any new interface to demonstrate to new users how it functions. See our other examples in the examples
directory.
Contact @JacksonBurns if you need assistance adding an existing workflow to astartes
. If this featurization scheme requires additional dependencies to function, we may add it as an additional extra package in the same way that molecules
in installed.
JORS Branch
astartes
corresponding JORS paper is stored in this repository on a separate branch. You can find paper.tex
on the aptly named jors-paper
paper.
Note for Maintainers: To push changes from the main
branch into the jors-paper
branch, run the Update JORS Branch
workflow.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file astartes-1.0.0b2.tar.gz
.
File metadata
- Download URL: astartes-1.0.0b2.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12a4bf14ce22a7e1e62953e63b902a0064feae5dfbd8b99ab2e31c04461614bd |
|
MD5 | 8a44bb884cbb2b735b305edb415c7254 |
|
BLAKE2b-256 | 343d2bd0e7292e63d2ed9b2f17a4791b6d53ee98de3680fe32839c17f08e6d30 |
File details
Details for the file astartes-1.0.0b2-py3-none-any.whl
.
File metadata
- Download URL: astartes-1.0.0b2-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6100539ff7b591a845a07a9bf69b9ec5791d0d7d2dd358e205428cbd4e5f78d5 |
|
MD5 | 44886af4d6ab8d2081008dfc30c913e6 |
|
BLAKE2b-256 | 7b407ae5f22ce385c1e4586433c9100c44fb51018fbf1d86fb915ad20a2a7e04 |