Skip to main content

Algorithmic train:test splitting for molecules, images, and arbitrary arrays.

Project description

extended_train_test_split

Algorithmic train:test splitting for molecules, images, and arbitrary arrays.

extended_train_test_splitlogo

GitHub Repo Stars PyPI - Downloads PyPI PyPI - License

Online Documentation

Click here to read the documentation

Background

Rational Splitting Algorithms

While much machine learning is done with a random choice between training/test/validation data, an alternative is the use of so-called "rational" splitting algorithms. These approaches use some similarity-based algorithm to divide data into sets. Some of these algorithms include Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms as discussed by Tropsha et. al as well as the DUPLEX, OptiSim, D-optimal, as discussed in Applied Chemoinformatics: Achievements and Future Opportunities. Some clustering-based splitting techniques have also been introduced, such as DBSCAN.

Splitting Algorithms

  • Random
  • Kennard-Stone (KS)
  • Minimal Test Set Dissimilarity
  • Sphere Exclusion
  • DUPLEX
  • OptiSim
  • D-Optimal
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Extending Functionality

Adding a new splitting method should take on this format:

from sklearn.model_selection import train_test_split

def random(
    X,
    y=None,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None,
):
    return train_test_split(
        X,
        y,
        test_size=test_size,
        train_size=train_size,
        random_state=random_state,
        shuffle=shuffle,
        stratify=stratify,
    )

It can be as simple as a passthrough to a another train_test_split, or it can be an original implementation that results in X and y being split into two lists.

Adding a new interface should take on this format:

from extended_train_test_split import train_test_split

def train_test_split_INTERFACE(
    INTERFACE_input,
    INTERFACE_ARGS,
    y: np.array = None,
    test_size: float = 0.25,
    train_size: float = 0.75,
    splitter: str = 'random',
    hopts: dict = {},
    INTERFACE_hopts: dict = {},
):
    # turn the INTERFACE_input into an input X
    # based on INTERFACE ARGS where INTERFACE_hopts
    # specifies additional behavior
    X = []
    
    # call train test split with this input
    return train_test_split(
        X,
        y=y,
        test_size=test_size,
        train_size=train_size,
        splitter=splitter,
        hopts=hopts,
    )

JOSS Branch

paper.md is stored in a separate branch aptly named joss-paper. To push changes from the main branch into the joss-paper branch, run the Update JOSS Branch workflow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astartes-0.0.0.tar.gz (6.0 kB view hashes)

Uploaded Source

Built Distribution

astartes-0.0.0-py3-none-any.whl (13.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page