Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays
Project description
astartes
Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays
Installing astartes
We reccomend installing astartes
within a virtual environment, using either venv
or conda
(or other tools) to simplify dependency management.
astartes
is availble on PyPI
and can be installed using pip
:
- To include the featurization options for chemical data, use
pip install astartes[molecules]
. - To install only the sampling algorithms, use
pip install astartes
(this install will have fewer depdencies and may be more readily compatible in environments with existing workflows).
Using astartes
astartes
is designed as a drop-in replacement for sklearn
's train_test_split
function. To switch to astartes
, change from sklearn.model_selection import train_test_split
to 'from astartes import train_test_split`.
By default, astartes
will use a random splitting approach identical to that which is implemented in sklearn
, and a variety of deterministic sampling approaches can be used by specifying one additional argument ot the function:
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
sampler = 'kennard_stone', # any of the supported samplers
)
There are two broad categories of sampling algorithms implemented in astartes
: supervised (requires labeled data) and unsupervised. All can be accessed via train_test_split
, but supervised algorithms require an additional argument labels
to be specified:
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
sampler = 'time_split', # any of the supported samplers
)
Here is a list of all implement sampling algorithms:
Sampler Name | Usage String | Type | Reference | Notes |
---|---|---|---|---|
Random | 'random' | Interpolative | sklearn train_test_split |
This sampler is a direct passthrough to sklearn's train_test_split . |
Scaffold | 'scaffold' | Extrapolative | chemprop 's scaffold_split |
This sampler is |
Sphere Exclusion | 'sphere_exclusion' | Extrapolative | custom implementation | Variation on Sphere Exclusion for arbitrary-valued vectors |
Using the astartes.molecules
Subpackage
After installing with pip install astartes[molecules]
one can import the new train/test splitting function like this: from astartes.molecules import train_test_split_molecules
The usage of this function is identical to train_test_split
but with the addition of new arguments to control how the molecules are featurized:
train_test_split_molecules(
smiles=smiles,
y=y,
test_size=0.2,
train_size=0.8,
fingerprint="daylight_fingerprint",
fprints_hopts={
"minPath": 2,
"maxPath": 5,
"fpSize": 200,
"bitsPerHash": 4,
"useHs": 1,
"tgtDensity": 0.4,
"minSize": 64,
},
splitter="random",
hopts={
"random_state": 42,
"shuffle": True,
},
)
Configuraiton options for the featurization scheme can be found in the documentation for AIMSim.
Online Documentation
Click here to read the documentation
Background
Rational Splitting Algorithms
While much machine learning is done with a random choice between training/test/validation data, an alternative is the use of so-called "rational" splitting algorithms. These approaches use some similarity-based algorithm to divide data into sets. Some of these algorithms include Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms as discussed by Tropsha et. al as well as the DUPLEX, OptiSim, D-optimal, as discussed in Applied Chemoinformatics: Achievements and Future Opportunities. Some clustering-based splitting techniques have also been introduced, such as DBSCAN.
Sampling Algorithms
- Random
- Kennard-Stone (KS)
- Minimal Test Set Dissimilarity
- Sphere Exclusion
- DUPLEX
- OptiSim
- D-Optimal
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
- KMEANS Split
- SPXY
- RBM
- Time Split
Development
To install the most updated release of astartes
for development purposes, use pip install -e --target=. asartes[molecules]
or clone this repository. Pull requests are welcome!
Adding New Samplers
Adding a new sampler should extend the abstract_sampler.py
abstract base class.
It can be as simple as a passthrough to a another train_test_split
, or it can be an original implementation that results in X and y being split into two lists. Take a look at astartes/samplers/random_split.py
for a basic example!
Adding New Featurization Schemes
All of the sampling methods implemented in astartes
accept arbitrary arrays of numbers and return the sampled groups -- if you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in astartes
.
Adding a new interface should take on this format:
from extended_train_test_split import train_test_split
def train_test_split_INTERFACE(
INTERFACE_input,
INTERFACE_ARGS,
y: np.array = None,
test_size: float = 0.25,
train_size: float = 0.75,
splitter: str = 'random',
hopts: dict = {},
INTERFACE_hopts: dict = {},
):
# turn the INTERFACE_input into an input X
# based on INTERFACE ARGS where INTERFACE_hopts
# specifies additional behavior
X = []
# call train test split with this input
return train_test_split(
X,
y=y,
test_size=test_size,
train_size=train_size,
splitter=splitter,
hopts=hopts,
)
JORS Branch
paper.tex
is stored in a separate branch aptly named jors-paper
. To push changes from the main
branch into the jors-paper
branch, run the Update JORS Branch
workflow.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for astartes-1.0.0a4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85ca427ed08b9fa4f8aeee271ad1c3258d6ca4f2c61a2054a38dad70a4ab0d82 |
|
MD5 | 784773c8703a3f39ed5691e87bd9717a |
|
BLAKE2b-256 | dec939eab70f6967094d22a63fd9e0e655917a0bc1914969a4ed6a402cc1e528 |