Clustering and dataset splitting for chemical data.
Project description
Chalcedon
Fast, memory-efficient Butina clustering and train/validation/test splitting for chemical datasets. Use this package to minimize data leakage when splitting chemical data to improve the evaluation and generalizability of your models.
Installation
uv pip install chalcedon
Quick start
Recommended
For the recommended case, run directly from SMILES. Chalcedon computes Morgan fingerprints (radius 2, 2048 bits) internally and clusters in float32:
import chalcedon
smiles = [
"CCO",
"c1ccccc1",
# ...your dataset
]
splits = chalcedon.butina_split(
smiles,
fractions={"train": 0.8, "val": 0.1, "test": 0.1},
cutoff=0.65,
dtype="float32" # or np.float32
)
train_smiles = splits["train"]
val_smiles = splits["val"]
test_smiles = splits["test"]
Using custom descriptors
We recommend dtype="float64" for non-binary descriptors, where dot-product magnitudes
can exceed float32's exact range.
import chalcedon
descriptors = my_descriptor_generator(molecules) # numpy.ndarray of shape (n, d)
cluster_ids = chalcedon.butina_cluster(descriptors, cutoff=0.65, dtype="float64")
splits = chalcedon.greedy_cluster_split(
cluster_ids,
fractions={"train": 0.8, "val": 0.1, "test": 0.1},
)
train_indices = splits["train"] # numpy.ndarray of indices into `descriptors`
pairwise_tanimoto(fingerprints) is also exposed if you want just the
similarity matrix.
Benchmarks
Chalcedon can quickly create Butina clusters of large chemical datasets on consumer hardware with near linear memory scaling.
See benchmarks/report.md for a detailed analysis of algorithm performance and benchmarks/ to reproduce results.
Citation
If you use Chalcedon in your research, please cite:
@software{chalcedon,
title = {Chalcedon: Clustering and dataset splitting for chemical data.},
year = {2026},
url = {https://github.com/rowansci/chalcedon}
}
Acknowledgements
- RDKit for cheminformatics infrastructure and the CrystalFF torsion library (Riniker & Landrum, J. Chem. Inf. Model. 56, 2016)
- GEOM dataset for the benchmark SMILES (Axelrod & Gomez-Bombarelli, Sci Data 9, 185, 2022)
This package was created with Cookiecutter and the jevandezande/uv-cookiecutter project template.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chalcedon-0.0.1.tar.gz.
File metadata
- Download URL: chalcedon-0.0.1.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2f7571124eb8271078320cd16ae3b1dba2fd49a7d7977137590be67b2f74484
|
|
| MD5 |
ed65f019348edb185706cd9ff76ca606
|
|
| BLAKE2b-256 |
cdab5287525618d822ccf1ab2e4f488e1547b2f4d45a7b130a4b1459a407156a
|
File details
Details for the file chalcedon-0.0.1-py3-none-any.whl.
File metadata
- Download URL: chalcedon-0.0.1-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86882e76d1412f5def2a636fc0a69ca6e0ea1617ec245d8c081a336f8d371544
|
|
| MD5 |
f9f40ca7099bcbce33f141557f69695a
|
|
| BLAKE2b-256 |
9fab685a197d5903e207a528c86a367ca46055a9da1fda18c020ddb68d58e14e
|