Chemistry-aware splitting of large datasets.
Project description
eluent
eluent is a Python library and command-line tool for chemistry-aware splitting of large datasets for machine learning. It provides three grouping strategies — random, Murcko scaffold, and approximate spectral clustering via FAISS — along with out-of-core percentile annotation. All operations work on datasets that don't fit in memory, using 🤗 Hugging Face Datasets for lazy processing and caching.
Contents
Installation
The easy way
You can install the pre-compiled version directly using pip.
$ pip install eluent
For GPU-accelerated FAISS splitting, install the optional GPU extra:
$ pip install eluent[splits_gpu]
From source
Clone the repository, then cd into it. Then run:
$ pip install -e .
Usage
Input data formats
In all cases, the input dataset can be:
- A path to a local file in CSV, Parquet, Arrow, or HF Dataset directory format (format is inferred from the file extension).
- A remote dataset hosted on 🤗 Datasets Hub, specified
as a
hf://URL:
hf://datasets/<owner>/<repo>~<config>:<split>
For example:
hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train
Output files are written in the format inferred from the extension of --output
(CSV, Parquet, Arrow, or HF Dataset directory).
Command-line interface
eluent provides two sub-commands. Run eluent --help to see the top-level help:
usage: eluent [-h] [--version] {split,percentiles} ...
Chemistry-aware splitting of large datasets.
options:
-h, --help show this help message and exit
--version, -v show program's version number and exit
Sub-commands:
{split,percentiles}
split Make chemical train-test-val splits on out-of-core datasets.
percentiles Add columns indicating whether rows are in a percentile.
Dataset splitting (eluent split)
Partition a dataset into train / validation / test splits using one of three grouping methods.
Groups are formed first, then packed into the requested split fractions using a bin-packing
algorithm. Pass --seed for randomised packing; omit it for fully deterministic packing.
usage: eluent split [-h] [--type {random,scaffold,faiss}] [--n-neighbors N_NEIGHBORS]
[--start START] [--end END] [--structure STRUCTURE]
[--input-representation {smiles,selfies,inchi,aa_seq}]
[--plot PLOT] [--plot-sample PLOT_SAMPLE] [--plot-seed PLOT_SEED]
[--extras [EXTRAS ...]] [--seed SEED] [--cache CACHE]
--output OUTPUT [--batch BATCH] [--kfolds KFOLDS]
[--train TRAIN] [--validation VALIDATION] [--test TEST]
input_file
positional arguments:
input_file Input file or hf:// URL.
options:
--type {random,scaffold,faiss}
Splitting method. Default: scaffold
--structure, -S Column containing chemical structure strings.
--input-representation, -R {smiles,selfies,inchi,aa_seq}
Structure string type. Default: smiles
--train Fraction of examples for training. Required.
--validation Fraction of examples for validation. Default: infer.
--test Fraction of examples for test. Default: infer.
--kfolds, -K Number of k-folds (overrides --validation). Default: 1
--seed, -i Random seed. Omit for deterministic bin-packing.
--n-neighbors, -k Nearest neighbours for FAISS grouping. Default: 10
--batch, -b Batch size. Default: 16
--cache Cache directory. Default: current directory.
--output, -o Output filename. Required.
--plot Filename for UMAP embedding plot.
--plot-sample, -n Subsample size for UMAP. Default: 20000
--plot-seed, -e Random seed for UMAP. Default: 42
--extras, -x Extra columns to colour in the UMAP plot.
--start First row to process. Default: 0
--end Last row to process. Default: end of dataset.
-h, --help show this help message and exit
Random splitting — each molecule is assigned to a group by hashing its row index.
Groups are reproducible for the same --seed.
$ eluent split \
hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
--type random \
--structure smiles \
--train 0.7 \
--validation 0.15 \
--seed 42 \
--output split/random.csv \
--plot split/random-plot.png
Scaffold splitting — molecules are grouped by their Murcko scaffold, so structurally similar compounds always end up in the same split. This is the default and the recommended method for chemistry ML benchmarks.
$ eluent split \
hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
--type scaffold \
--structure smiles \
--train 0.7 \
--validation 0.15 \
--output split/scaffold.csv \
--plot split/scaffold-plot.png
FAISS spectral splitting — Morgan fingerprints are computed for each molecule, a k-NN graph is built using FAISS binary index (Hamming distance), and connected components of the graph become groups. This produces splits where molecules in each component are all more similar to each other than to molecules in other components.
$ eluent split \
hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
--type faiss \
--structure smiles \
--n-neighbors 10 \
--train 0.7 \
--validation 0.15 \
--seed 42 \
--cache ./cache \
--output split/faiss.csv \
--plot split/faiss-plot.png
k-fold cross-validation — use --kfolds to generate multiple train/validation folds.
The --validation fraction is split among folds; --train sets the remaining fraction.
Output files are written to fold_0/, fold_1/, … sub-directories of the output path.
$ eluent split \
hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
--type scaffold \
--structure smiles \
--train 0.7 \
--validation 0.15 \
--kfolds 5 \
--seed 42 \
--output split/scaffold.csv
Percentile annotation (eluent percentiles)
Add boolean columns to a dataset flagging which rows fall within the top-k percentile of one or more numeric columns. Uses a T-Digest quantile sketch, so the full dataset never needs to be loaded into memory. A two-pass "definitely / maybe / definitely-not" strategy ensures exact counts.
usage: eluent percentiles [-h] [--columns [COLUMNS ...]] [--percentiles [PERCENTILES ...]]
[--reverse] [--compression COMPRESSION] [--delta DELTA]
[--start START] [--end END] [--cache CACHE] --output OUTPUT
[--batch BATCH] [--plot PLOT] [--structure STRUCTURE]
[--input-representation {smiles,selfies,inchi,aa_seq}]
[--plot-sample PLOT_SAMPLE] [--plot-seed PLOT_SEED]
[--extras [EXTRAS ...]]
input_file
positional arguments:
input_file Input file or hf:// URL.
options:
--columns, -c Columns to tag. Required.
--percentiles, -p Percentile thresholds. Default: 5
--reverse, -r Tag bottom percentiles instead of top.
--compression, -z T-Digest centroids (higher = more accurate). Default: 500
--delta, -d Buffer width around percentile cutoff. Default: 1.0
--batch, -b Batch size. Default: 16
--cache Cache directory. Default: current directory.
--output, -o Output filename. Required.
--plot Filename for UMAP embedding plot.
--structure, -S Structure column (required for --plot).
--input-representation, -R {smiles,selfies,inchi,aa_seq}
Structure string type. Default: smiles
--plot-sample, -n Subsample size for UMAP. Default: 20000
--plot-seed, -e Random seed for UMAP. Default: 42
--extras, -x Extra columns to colour in the UMAP plot.
--start First row to process. Default: 0
--end Last row to process. Default: end of dataset.
-h, --help show this help message and exit
Tag the top 1 %, 5 %, and 10 % of cLogP and TPSA values:
$ eluent percentiles \
hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train \
--columns clogp tpsa \
--percentiles 1 5 10 \
--batch 128 \
--cache ./cache \
--output percentiles/tagged.csv \
--plot percentiles/tagged-plot.png \
--structure smiles
For each requested column and percentile, a new boolean column is added to the output dataset with
the name <column>_top_<percentile>_pc (e.g. clogp_top_5_pc).
Python API
Splitting datasets
The high-level entry point is split_dataset, which accepts a 🤗 Dataset or IterableDataset
and returns a DatasetDict:
from datasets import load_dataset
from eluent.utils.splitting import split_dataset
ds = load_dataset(
"scbirlab/fang-2023-biogen-adme",
"scaffold-split",
split="train",
)
# Random split — 70 % train, 15 % validation, 15 % test
split = split_dataset(
ds,
method="random",
structure_column="smiles",
train=0.7,
validation=0.15,
seed=42,
)
# split["train"], split["validation"], split["test"]
Pass method="scaffold" or method="faiss" for chemistry-aware grouping:
# Scaffold split
split = split_dataset(
ds,
method="scaffold",
structure_column="smiles",
train=0.7,
validation=0.15,
)
# FAISS spectral split
split = split_dataset(
ds,
method="faiss",
structure_column="smiles",
n_neighbors=10,
train=0.7,
validation=0.15,
seed=42,
cache="./cache",
)
For k-fold cross-validation, pass kfolds > 1. The function then returns a tuple of
DatasetDicts, one per fold:
folds = split_dataset(
ds,
method="scaffold",
structure_column="smiles",
train=0.7,
validation=0.15,
kfolds=5,
)
for i, fold in enumerate(folds):
train_ds = fold["train"]
val_ds = fold["validation"]
For finer control, use the SplitDataset class directly:
from eluent.utils.splitting.splitter import SplitDataset
sd = SplitDataset(ds)
sd.group_and_split(
method="scaffold",
structure_column="smiles",
train=0.7,
validation=0.15,
)
result = sd.dataset # DatasetDict
Annotating percentiles
from datasets import load_dataset
from eluent.utils.splitting.top_k import percentiles
ds = load_dataset(
"scbirlab/fang-2023-biogen-adme",
"scaffold-split",
split="train",
streaming=True,
)
tagged = percentiles(
ds=ds,
q={"clogp": [1, 5, 10], "tpsa": [5]},
compression=500, # T-Digest centroids
delta=1.0, # buffer width around cutoff
cache="./cache",
)
# New columns: clogp_top_1_pc, clogp_top_5_pc, clogp_top_10_pc, tpsa_top_5_pc
Issues, problems, suggestions
Add to the issue tracker.
Documentation
(To come at ReadTheDocs.)
Roadmap
The following features are planned for future releases:
- Additional grouping methods — pharmacophore-based, reaction-centre-based, and taxonomy-based grouping strategies, alongside the current random, scaffold, and FAISS methods.
- Additional FAISS featurizers — plug-in support for alternative fingerprint types (ECFP with configurable radius, MACCS keys, RDKit topological) and learned molecular embeddings, so the k-NN graph can be built on richer similarity measures.
- Additional split strategies — stratified splitting to preserve target-label class balance across splits; time-based splits for temporal datasets.
- Multi-group / hierarchical splitting — chain multiple grouping passes (e.g. scaffold then random) to produce nested or combined splits in a single command.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eluent-0.0.2.post1.tar.gz.
File metadata
- Download URL: eluent-0.0.2.post1.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65e83215f3abfd4dcaf0e8f6f26635ed3fe0dde86e3bc88d692446cb1581c28a
|
|
| MD5 |
52d4f2a4b97a812a62cc15bc88f7dd79
|
|
| BLAKE2b-256 |
bb4fa4430bbd3d5c4aff2d142615c46e02929b73e4d4eb6273e3387f5e4e188a
|
File details
Details for the file eluent-0.0.2.post1-py3-none-any.whl.
File metadata
- Download URL: eluent-0.0.2.post1-py3-none-any.whl
- Upload date:
- Size: 30.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ab48f53e652abd903ef9c84e239d7278ac001e3bc5f76c4dd4608b22187a75e
|
|
| MD5 |
dabd9dce0dd9a19dcab80935a497d757
|
|
| BLAKE2b-256 |
99919869dd2ac8691ec10506013b5221cea6873e55821a50a093988341d3438a
|