Skip to main content

Python package to stratify split datasets based on endpoint distributions

Project description

Here's the updated documentation encapsulated in a code block for clarity:

vbnet Copy code

Ivers

Ivers offers a suite of tools designed for managing data splits while maintaining endpoint distributions, and introduces two novel temporal split techniques: 'Leaky' and 'All for Free'. This library ensures that data splits are suitable for realistic scenarios and rigorous testing needs in various applications. It was utilized to generate data splits in the research outlined in the linked paper.

Features

  • Temporal Leaky: Simulates real-world scenarios by allowing forward-leakage in data, which might subtly influence future models.
  • Temporal AllForFree: Ensures strict temporal separation, keeping training data completely independent of the test set—ideal for accurate long-term model predictions.
  • Temporal Fold Split: Progressively increases the training set size across multiple folds, adhering to the temporal sequence, enhancing model robustness over time.
  • Stratified Endpoint Split: Introduces a stratified approach to splitting, crucial for consistent endpoint distribution across different categories in datasets—beneficial in fields like cheminformatics and bioinformatics.

Code Functions

The library includes several functions tailored for different splitting strategies:

  • stratify_endpoint, stratify_split_and_cv: These functions generate train/test and cross-validation splits that respect endpoint distribution.
  • leaky_endpoint_split, allforone_endpoint_split: Used for generating a single train/test split with respective temporal dynamics.
  • allforone_folds_endpoint_split, leaky_folds_endpoint_split: Enable multiple sectional splits, increasing training data size consistently.
  • balanced_scaffold_cv: Supports balanced scaffold cross-validation, enhancing data representativeness in splits.

Integration with Chemprop

  • Activating the chemprop configuration allows the library to generate splits that are directly compatible with the Chemprop framework, facilitating seamless integration and usage.

Getting Started or Contributing

To begin using Ivers, clone the repository and set up the necessary dependencies:

git clone https://github.com/IversOhlsson/ivers.git
cd ivers
pip install -r requirements.txt

Installation via pip

You can also install the package via pip:

pip install ivers

We welcome contributions! Feel free to open issues or pull requests on our GitHub repository.

Guide

Reference

when using this library, please cite the following paper:

@article{Ivers_1,
  title={PlaceHolder},
  author={PlaceHolder},
  journal={PlaceHolder},
  volume={PlaceHolder},
  number={PlaceHolder},
  pages={PlaceHolder},
  year={PlaceHolder},
  publisher={PlaceHolder}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ivers-0.2.2.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

ivers-0.2.2-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file ivers-0.2.2.tar.gz.

File metadata

  • Download URL: ivers-0.2.2.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for ivers-0.2.2.tar.gz
Algorithm Hash digest
SHA256 79d7e9c0543c255402ba52380ef1ec8f0625e90eeada7bc14637c51d21812fa7
MD5 9851890624aa19605cf7ad894cad5146
BLAKE2b-256 64b8d28771e887c32c5bab8b59362ce64b260bfc96ed345bf972c8379435666c

See more details on using hashes here.

File details

Details for the file ivers-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: ivers-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for ivers-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f917f6e918da74cd9c059d9e12e4a277fef0831d2d24b0be370edcc4f68ee425
MD5 15f416094bec3d435faac11d9f24dc03
BLAKE2b-256 aa402057d587be10adbecc9288a9309a6fd0686ab531c5bc6c32f3a072aed6dd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page