Skip to main content

Chemistry-aware splitting of large datasets.

Project description

eluent

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

eluent is a Python library and command-line tool for chemistry-aware splitting of large datasets for machine learning. It provides three grouping strategies — random, Murcko scaffold, and approximate spectral clustering via FAISS — along with out-of-core percentile annotation. All operations work on datasets that don't fit in memory, using 🤗 Hugging Face Datasets for lazy processing and caching.

Contents

Installation

The easy way

You can install the pre-compiled version directly using pip.

$ pip install eluent

For GPU-accelerated FAISS splitting, install the optional GPU extra:

$ pip install eluent[splits_gpu]

From source

Clone the repository, then cd into it. Then run:

$ pip install -e .

Usage

Input data formats

In all cases, the input dataset can be:

  • A path to a local file in CSV, Parquet, Arrow, or HF Dataset directory format (format is inferred from the file extension).
  • A remote dataset hosted on 🤗 Datasets Hub, specified as a hf:// URL:
hf://datasets/<owner>/<repo>~<config>:<split>

For example:

hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train

Output files are written in the format inferred from the extension of --output (CSV, Parquet, Arrow, or HF Dataset directory).

Command-line interface

eluent provides two sub-commands. Run eluent --help to see the top-level help:

usage: eluent [-h] [--version] {split,percentiles} ...

Chemistry-aware splitting of large datasets.

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

Sub-commands:
  {split,percentiles}
    split               Make chemical train-test-val splits on out-of-core datasets.
    percentiles         Add columns indicating whether rows are in a percentile.

Dataset splitting (eluent split)

Partition a dataset into train / validation / test splits using one of three grouping methods. Groups are formed first, then packed into the requested split fractions using a bin-packing algorithm. Pass --seed for randomised packing; omit it for fully deterministic packing.

usage: eluent split [-h] [--type {random,scaffold,faiss}] [--n-neighbors N_NEIGHBORS]
                    [--start START] [--end END] [--structure STRUCTURE]
                    [--input-representation {smiles,selfies,inchi,aa_seq}]
                    [--plot PLOT] [--plot-sample PLOT_SAMPLE] [--plot-seed PLOT_SEED]
                    [--extras [EXTRAS ...]] [--seed SEED] [--cache CACHE]
                    --output OUTPUT [--batch BATCH] [--kfolds KFOLDS]
                    [--train TRAIN] [--validation VALIDATION] [--test TEST]
                    input_file

positional arguments:
  input_file            Input file or hf:// URL.

options:
  --type {random,scaffold,faiss}
                        Splitting method. Default: scaffold
  --structure, -S       Column containing chemical structure strings.
  --input-representation, -R {smiles,selfies,inchi,aa_seq}
                        Structure string type. Default: smiles
  --train               Fraction of examples for training. Required.
  --validation          Fraction of examples for validation. Default: infer.
  --test                Fraction of examples for test. Default: infer.
  --kfolds, -K          Number of k-folds (overrides --validation). Default: 1
  --seed, -i            Random seed. Omit for deterministic bin-packing.
  --n-neighbors, -k     Nearest neighbours for FAISS grouping. Default: 10
  --batch, -b           Batch size. Default: 16
  --cache               Cache directory. Default: current directory.
  --output, -o          Output filename. Required.
  --plot                Filename for UMAP embedding plot.
  --plot-sample, -n     Subsample size for UMAP. Default: 20000
  --plot-seed, -e       Random seed for UMAP. Default: 42
  --extras, -x          Extra columns to colour in the UMAP plot.
  --start               First row to process. Default: 0
  --end                 Last row to process. Default: end of dataset.
  -h, --help            show this help message and exit

Random splitting — each molecule is assigned to a group by hashing its row index. Groups are reproducible for the same --seed.

$ eluent split \
    hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
    --type random \
    --structure smiles \
    --train 0.7 \
    --validation 0.15 \
    --seed 42 \
    --output split/random.csv \
    --plot split/random-plot.png

Scaffold splitting — molecules are grouped by their Murcko scaffold, so structurally similar compounds always end up in the same split. This is the default and the recommended method for chemistry ML benchmarks.

$ eluent split \
    hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
    --type scaffold \
    --structure smiles \
    --train 0.7 \
    --validation 0.15 \
    --output split/scaffold.csv \
    --plot split/scaffold-plot.png

FAISS spectral splitting — Morgan fingerprints are computed for each molecule, a k-NN graph is built using FAISS binary index (Hamming distance), and connected components of the graph become groups. This produces splits where molecules in each component are all more similar to each other than to molecules in other components.

$ eluent split \
    hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
    --type faiss \
    --structure smiles \
    --n-neighbors 10 \
    --train 0.7 \
    --validation 0.15 \
    --seed 42 \
    --cache ./cache \
    --output split/faiss.csv \
    --plot split/faiss-plot.png

k-fold cross-validation — use --kfolds to generate multiple train/validation folds. The --validation fraction is split among folds; --train sets the remaining fraction. Output files are written to fold_0/, fold_1/, … sub-directories of the output path.

$ eluent split \
    hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
    --type scaffold \
    --structure smiles \
    --train 0.7 \
    --validation 0.15 \
    --kfolds 5 \
    --seed 42 \
    --output split/scaffold.csv

Percentile annotation (eluent percentiles)

Add boolean columns to a dataset flagging which rows fall within the top-k percentile of one or more numeric columns. Uses a T-Digest quantile sketch, so the full dataset never needs to be loaded into memory. A two-pass "definitely / maybe / definitely-not" strategy ensures exact counts.

usage: eluent percentiles [-h] [--columns [COLUMNS ...]] [--percentiles [PERCENTILES ...]]
                           [--reverse] [--compression COMPRESSION] [--delta DELTA]
                           [--start START] [--end END] [--cache CACHE] --output OUTPUT
                           [--batch BATCH] [--plot PLOT] [--structure STRUCTURE]
                           [--input-representation {smiles,selfies,inchi,aa_seq}]
                           [--plot-sample PLOT_SAMPLE] [--plot-seed PLOT_SEED]
                           [--extras [EXTRAS ...]]
                           input_file

positional arguments:
  input_file            Input file or hf:// URL.

options:
  --columns, -c         Columns to tag. Required.
  --percentiles, -p     Percentile thresholds. Default: 5
  --reverse, -r         Tag bottom percentiles instead of top.
  --compression, -z     T-Digest centroids (higher = more accurate). Default: 500
  --delta, -d           Buffer width around percentile cutoff. Default: 1.0
  --batch, -b           Batch size. Default: 16
  --cache               Cache directory. Default: current directory.
  --output, -o          Output filename. Required.
  --plot                Filename for UMAP embedding plot.
  --structure, -S       Structure column (required for --plot).
  --input-representation, -R {smiles,selfies,inchi,aa_seq}
                        Structure string type. Default: smiles
  --plot-sample, -n     Subsample size for UMAP. Default: 20000
  --plot-seed, -e       Random seed for UMAP. Default: 42
  --extras, -x          Extra columns to colour in the UMAP plot.
  --start               First row to process. Default: 0
  --end                 Last row to process. Default: end of dataset.
  -h, --help            show this help message and exit

Tag the top 1 %, 5 %, and 10 % of cLogP and TPSA values:

$ eluent percentiles \
    hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
    --columns clogp tpsa \
    --percentiles 1 5 10 \
    --batch 128 \
    --cache ./cache \
    --output percentiles/tagged.csv \
    --plot percentiles/tagged-plot.png \
    --structure smiles

For each requested column and percentile, a new boolean column is added to the output dataset with the name <column>_top_<percentile>_pc (e.g. clogp_top_5_pc).

Python API

Splitting datasets

The high-level entry point is split_dataset, which accepts a 🤗 Dataset or IterableDataset and returns a DatasetDict:

from datasets import load_dataset
from eluent.utils.splitting import split_dataset

ds = load_dataset(
    "scbirlab/fang-2023-biogen-adme",
    "scaffold-split",
    split="train",
)

# Random split — 70 % train, 15 % validation, 15 % test
split = split_dataset(
    ds,
    method="random",
    structure_column="smiles",
    train=0.7,
    validation=0.15,
    seed=42,
)
# split["train"], split["validation"], split["test"]

Pass method="scaffold" or method="faiss" for chemistry-aware grouping:

# Scaffold split
split = split_dataset(
    ds,
    method="scaffold",
    structure_column="smiles",
    train=0.7,
    validation=0.15,
)

# FAISS spectral split
split = split_dataset(
    ds,
    method="faiss",
    structure_column="smiles",
    n_neighbors=10,
    train=0.7,
    validation=0.15,
    seed=42,
    cache="./cache",
)

For k-fold cross-validation, pass kfolds > 1. The function then returns a tuple of DatasetDicts, one per fold:

folds = split_dataset(
    ds,
    method="scaffold",
    structure_column="smiles",
    train=0.7,
    validation=0.15,
    kfolds=5,
)
for i, fold in enumerate(folds):
    train_ds = fold["train"]
    val_ds   = fold["validation"]

For finer control, use the SplitDataset class directly:

from eluent.utils.splitting.splitter import SplitDataset

sd = SplitDataset(ds)
sd.group_and_split(
    method="scaffold",
    structure_column="smiles",
    train=0.7,
    validation=0.15,
)
result = sd.dataset  # DatasetDict

Annotating percentiles

from datasets import load_dataset
from eluent.utils.splitting.top_k import percentiles

ds = load_dataset(
    "scbirlab/fang-2023-biogen-adme",
    "scaffold-split",
    split="train",
    streaming=True,
)

tagged = percentiles(
    ds=ds,
    q={"clogp": [1, 5, 10], "tpsa": [5]},
    compression=500,  # T-Digest centroids
    delta=1.0,        # buffer width around cutoff
    cache="./cache",
)
# New columns: clogp_top_1_pc, clogp_top_5_pc, clogp_top_10_pc, tpsa_top_5_pc

Issues, problems, suggestions

Add to the issue tracker.

Documentation

(To come at ReadTheDocs.)

Roadmap

The following features are planned for future releases:

  • Additional grouping methods — pharmacophore-based, reaction-centre-based, and taxonomy-based grouping strategies, alongside the current random, scaffold, and FAISS methods.
  • Additional FAISS featurizers — plug-in support for alternative fingerprint types (ECFP with configurable radius, MACCS keys, RDKit topological) and learned molecular embeddings, so the k-NN graph can be built on richer similarity measures.
  • Additional split strategies — stratified splitting to preserve target-label class balance across splits; time-based splits for temporal datasets.
  • Multi-group / hierarchical splitting — chain multiple grouping passes (e.g. scaffold then random) to produce nested or combined splits in a single command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eluent-0.0.2.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eluent-0.0.2-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file eluent-0.0.2.tar.gz.

File metadata

  • Download URL: eluent-0.0.2.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for eluent-0.0.2.tar.gz
Algorithm Hash digest
SHA256 49bf5f34de17df77ab1b512cd28e84137b79d0288112f13ba8435a871218fb8d
MD5 71c36040acdd1b0423dd936d8982bb78
BLAKE2b-256 92ae54ccb27d5f3e36da133136bafc266214244e2096f8bf0483efbd7eaed440

See more details on using hashes here.

File details

Details for the file eluent-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: eluent-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for eluent-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d523c1043627b6146010b03bdddf9145f046ec6e1a9a2f715a882ccf67ebcfd9
MD5 e343ed2ac43b47f3aae7810c54197c2b
BLAKE2b-256 c2814a643187deb31722c305c2766fd957a540d62ebc64eb2c95264386136c3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page