Skip to main content

A package to compute hard out-of-distribution data splits for machine learning, challenging generalization of models.

Project description

DataSAIL: Data Splitting Against Information Leaking

testing docs-image codecov anaconda update license downloads Python 3 DOI

DataSAIL, short for Data Splitting Against Information Leakage, is a versatile tool designed to partition data while minimizing similarities between the partitions. Inter-sample similarities can lead to information leakage, resulting in an overestimation of the model's performance in certain training regimes.

DataSAIL was initially developed for machine learning workflows involving biological datasets, but its utility extends to any type of datasets. It can be used through a command line interface or integrated as a Python package, making it accessible and user-friendly. The tool is licensed under the MIT license, ensuring it remains open source and freely available here on GitHub.

A detailed documentation of the package, explanations, examples, and much more are given on DataSAIL's ReadTheDocs page.

Installation

DataSAIL is available for all modern versions of Python (v3.9 or newer). We ship two versions of DataSAIL:

  • DataSAIL: The full version of DataSAIL, which includes all third-party clustering algorithms and is available on conda for linux and OSX (called datasail).
  • DataSAIL-lite: A lightweight version of DataSAIL, which does not include any third-party clustering algorithms and is available on PyPI (called datasail) and conda (called datasail-lite).

NOTE: There is a naming-inconsitency between the conda and PyPI versions of DataSAIL. The lite version is called datasail-lite on conda, while it is called datasail on PyPI. This will be fixed in the future, but for now, please be aware of this inconsistency.

Usage

DataSAIL is installed as a command-line tool. So, in the conda environment, DataSAIL has been installed to, you can run

datasail --e-type P --e-data <path_to_fasta> --e-sim mmseqs --output <path_to_output_path> --technique C1e

to split a set of proteins that have been clustered using mmseqs. For a full list of arguments, run datasail -h and checkout ReadTheDocs. There is a more detailed explanation of the arguments and example notebooks. The runtime largy depends on the number and type of splits to be computed and the size of the dataset. For small datasets (less then 10k samples) DataSAIL finished within minutes. On large datasets (more than 100k samples) it can take several hours to complete. Regardless of which installation command was used, DataSAIL can be executed by running

datasail -h

in the command line and see the parameters DataSAIL takes. DataSAIL can also directly be included as a normal package into your Python program using

from datasail.sail import datasail
splits = datasail(...)

For more information about the parameters, please read through the documentation page

When to use DataSAIL and when not to use

splits DataSAIL offers a variety of ways to split one-dimensional and multi-dimensional data. Here exemplarily shown for a generic protein property prediction task and a protein-ligand interaction prediction dataset.

The datasplit employed should always reflect the inference reality the model is facing. So, if the model is intended to perform well on unseen data, the validation and test data shall be new between splits.

For more information, please see our guideline to selecting datasplits in the documentation.

Citation

If you used DataSAIL to split your data, please cite DataSAIL in your publication.

@article{joeres2025datasail,
  title={Data splitting to avoid information leakage with DataSAIL},
  author={Joeres, Roman and Blumenthal, David B. and Kalinina, Olga V.},
  journal={Nature Communications},
  volume={16},
  pages={3337},
  year={2025},
  doi={10.1038/s41467-025-58606-8},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasail-1.3.0.tar.gz (72.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasail-1.3.0-py3-none-any.whl (100.9 kB view details)

Uploaded Python 3

File details

Details for the file datasail-1.3.0.tar.gz.

File metadata

  • Download URL: datasail-1.3.0.tar.gz
  • Upload date:
  • Size: 72.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for datasail-1.3.0.tar.gz
Algorithm Hash digest
SHA256 a6bb5f7f3fe40eb3504dffd197d9e7b8f768ea76c7e8b059ed0ae001f93f1929
MD5 a6932542e84ff880b919afe9acbc080e
BLAKE2b-256 e81ae7f0f14c56baf87f2b96c0de6afaa0b7a7b525963d11f1066d8ccfeef9c9

See more details on using hashes here.

File details

Details for the file datasail-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: datasail-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 100.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for datasail-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c6bed9c43e8f35f1e084db21835cfcaa9f862fff823fef68aa9dd3a34be3e13
MD5 621484d4fd8450980df13fe9b7a7f22e
BLAKE2b-256 748b5b50cf3d0c310f20c3f3c76f274f7b4f121383158706023b4ea1f03112c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page