Unofficial Python implementation of CSRML chemotype fingerprints (ToxPrint v2 and TxP_PFAS v1).

These details have not been verified by PyPI

Project links

Project description

pyCSRML

pyCSRML is a pure-Python re-implementation of the Chemical Subgraph Representation Markup Language (CSRML). It parses CSRML XML files, converts the subgraph patterns to SMARTS, and computes binary chemotype fingerprints for molecules using RDKit.

The module is not an exact replicate of the original CSRML (see performance section). the original software should be preferred.

The module was implemented from two fingerprints descriptions:

Fingerprint	Bits	Description	Sourcde
ToxPrint v2.0	729	General toxicologically relevant substructures	Yang et al. 2015
TxP_PFAS v1.0	129	Per- and polyfluoroalkyl substance (PFAS) chemotypes	Richard et al. 2023

Performance

Accuracy is measured by comparing pyCSRML bit vectors against the reference ChemoTyper tool output. Run pytest tests/test_chemotyper_concordance.py -v -s to reproduce; the full per-bit breakdown is written to tests/concordance_report.md.

Dataset	Compounds	Fingerprint	Overall accuracy	Bits ≥ 90 % acc	Macro MCC	Macro Bal Acc	Macro ROC-AUC
Richard et al. 2023 (PFAS set)	14 710	TxP_PFAS v1	99.99 %	129 / 129	0.9971	0.9989	0.9989
ToxCast (full)	9 014	ToxPrint v2	99.71 %	725 / 729	0.9326	0.9703	0.9703
ToxCast (CF-containing subset)	808	TxP_PFAS v1	99.98 %	129 / 129	0.9905	0.9924	0.9924
CLinventory	181 745	ToxPrint v2	99.77 %	726 / 729	0.9320	0.9710	0.9710
CLinventory	181 745	TxP_PFAS v1	100.00 %	129 / 129	0.9936	0.9946	0.9946

Reading the table: "CF-containing subset" means only the 808 ToxCast compounds for which ChemoTyper sets at least one TxP_PFAS bit — the meaningful subset for PFAS accuracy benchmarking. Full-dataset TxP_PFAS accuracy appears inflated (100 %) because the vast majority of compounds are all-zero for every PFAS bit.

Known discrepancies

The 4 bits below 90 % accuracy in ToxPrint v2 are in ring heteroatom and chain chemotype groups; TxP_PFAS v1 has 3 bits below 100 % (all above 98.9 %).
Root causes (see tests/concordance_report.md):

Bit / category	Fingerprint	Accuracy	Direction	Root cause
`ring:hetero_[6]_Z_generic`	ToxPrint	54.6 %	False positives	Over-broad 6-membered heteroatom-ring SMARTS; pyCSRML prevalence 69.6 % vs ChemoTyper 24.3 %
`chain:alkaneBranch_isopropyl_C3`	ToxPrint	74.1 %	False positives	Ring-attachment SMARTS permissive on `noZ` (not-connected-to-heteroatom) modifier; pyCSRML prevalence 37.5 % vs ChemoTyper 11.6 %
`chain:alkaneCyclic_ethyl_C2_(connect_noZ)`	ToxPrint	75.9 %	False positives	Same `noZ` over-matching; pyCSRML prevalence 41.8 % vs ChemoTyper 17.7 %
`chain:alkeneCyclic_ethene_generic`	ToxPrint	87.0 %	False positives	Cyclic alkene SMARTS over-matches; pyCSRML prevalence 17.4 % vs ChemoTyper 10.3 %
`pfas_chain:alkeneLinear_mono-ene_ethylene_generic_F`	TxP_PFAS	98.9 %	False negatives (recall 40 %)	RDKit perceives the C=C of tautomeric fluoropyrimidines (5-fluorouracil) as aromatic; the SMARTS `[#9]-[#6;A]=[#6;A]` requires aliphatic atoms and misses them
`pfas_bond:C=N_imine_FCN`	TxP_PFAS	99.5 %	False negatives (recall 33 %)	Same aromaticity issue: the C=N bond in fluorinated heterocycles is perceived as aromatic by RDKit, so the aliphatic imine SMARTS does not match
`pfas_bond:aromatic_FCc1c`	TxP_PFAS	99.5 %	Slight false positives (precision 97.2 %)	Aromatic F-C pattern slightly over-matches due to SMARTS approximation of the exception clause

Timing Benchmark

Five molecule-size-stratified sets are extracted from the CLinventory and used to compare pyCSRML speed against ChemoTyper on realistic chemical diversity.

Set	Heavy-atom range	Molecules
`bench_tiny`	1 – 10	auto
`bench_small`	11 – 20	auto
`bench_medium`	21 – 35	auto
`bench_large`	36 – 60	auto
`bench_xlarge`	61 +	auto

Timing results (ms / molecule)

Set	Heavy atoms	pyCSRML ToxPrint v2	pyCSRML TxP_PFAS v1	ChemoTyper ToxPrint v2	ChemoTyper TxP_PFAS v1
bench_tiny	1 – 10	3.76	0.73	13.83	4.29
bench_small	11 – 20	5.47	1.01	27.70	7.74
bench_medium	21 – 35	8.23	1.53	59.63	17.87
bench_large	36 – 60	12.32	2.19	114.64	30.53
bench_xlarge	61 +	23.20	4.46	322.33	139.09

pyCSRML measured on Snapdragon X Elite X1E78100 (ARM64, 12 cores, ~32 GB RAM), Python 3.14.2, RDKit 2025.09.3, NumPy 2.3.5; 5 repetitions, median reported. ChemoTyper timings measured manually on the same machine, 3 repetitions, mean reported; values are of limited precision due to the manual measurement procedure. 500 molecules per set.

How to reproduce

1. Extract benchmark sets (one-time):

python scripts/create_size_benchmarks.py

Outputs tests/test_data/size_benchmarks/bench_*.smiles, bench_metadata.csv, and chemotyper_timing_template.csv.

2. Time pyCSRML (saves pycsrml_timing_baseline.json):

python scripts/benchmark_pycsrml_timing.py          # 5 reps by default
python scripts/benchmark_pycsrml_timing.py --reps 3 # faster

3. Run ChemoTyper on each .smiles file (ToxPrint V2 and TxP_PFAS v1), export results as TSV, and place zips in tests/test_data/size_benchmarks/:

bench_tiny_toxprint.zip    bench_tiny_txppfas.zip
bench_small_toxprint.zip   bench_small_txppfas.zip
...

Fill in the three-repetition ChemoTyper timing in chemotyper_timing_template.csv.

4. Run regression tests:

pytest tests/test_benchmark_regression.py -v -m slow

Timing regression: fails if any set is >30 % slower than the saved baseline.
Accuracy regression: fails if overall bit accuracy drops >0.1 pp from baseline. Both test types skip gracefully until their respective baseline / zip files exist.

Installation

The module needs RDKit installed. If necessary, start by installing a environment manager first (e.g. Conda/Mamba, like Miniforge3) and creating an environment, e.g.:

mamba create -n rdkit pytho
mamba activate rdkit
mamba install -y -c rdkit rdkit

Then install pyCSRML via PyPI:

pip install pyCSRML

Quick start

Single molecule (ToxPrint v2.0, 729 bits)

from pyCSRML import Fingerprinter, TOXPRINT_PATH
from rdkit import Chem

fp = Fingerprinter(TOXPRINT_PATH)

mol = Chem.MolFromSmiles("c1ccccc1")   # benzene
arr, names = fp.fingerprint(mol)

print(f"Bits set: {arr.sum()} / {fp.n_bits}")
on_bits = [names[i] for i in range(len(arr)) if arr[i]]
print(on_bits[:5])

TxP_PFAS fingerprints (129 bits)

from pyCSRML import Fingerprinter, TXPPFAS_PATH
from rdkit import Chem

fp = Fingerprinter(TXPPFAS_PATH)
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O")  # PFOA
arr, names = fp.fingerprint(mol)
print(f"Bits set: {arr.sum()} / {fp.n_bits}")

Batch fingerprinting

from pyCSRML import Fingerprinter, TXPPFAS_PATH
from rdkit import Chem

mols = [Chem.MolFromSmiles(s) for s in smiles_list]
fp = Fingerprinter(TXPPFAS_PATH)
matrix = fp.fingerprint_batch(mols)   # shape (n_mols, 129), dtype bool
print(matrix.shape)

Low-level CSRML parsing

from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list

data = parse_csrml_xml("path/to/my_fingerprints.xml")
bits = ordered_bit_list(data)
for bit in bits[:3]:
    print(bit["id"], bit["smarts"])

API overview

Symbol	Module	Description
`Fingerprinter`	`pyCSRML`	Compute fingerprints from any CSRML XML, JSON, or YAML definition
`TOXPRINT_PATH`	`pyCSRML`	Path to the bundled ToxPrint v2.0 JSON (729 bits)
`TXPPFAS_PATH`	`pyCSRML`	Path to the bundled TxP_PFAS v1.0.4 JSON (129 bits)
`parse_csrml_xml`	`pyCSRML._csrml`	Parse raw CSRML XML → Python dict
`ordered_bit_list`	`pyCSRML._csrml`	Return all bits in order from a parsed dict

Full API reference: pycsrml.readthedocs.io

CSRML features supported

Feature	Status
`substructureMatch` → SMARTS	✅ Full
`substructureException` (global)	✅ Full
`matchingQueryAtom` → `[!$(...)]` folding	✅ Full
`combineAtomFeatures` (OR-of-AND trees)	✅ Full
`atomList` with `negate="true"`	✅ Full
`attachedHydrogenCount` ranges	✅ Full
`ringCountAtom` / `ringAtom` / `chainAtom`	✅ Full
Pseudo-elements G, Z, Q, X	✅ Full
`mustMatch` / `mustNotMatch` (test cases)	parsed, not used for matching

Development

git clone https://github.com/LucMiaz/pyCSRML
cd pyCSRML
pip install -e ".[dev]"

# Run tests (fast)
pytest -m "not slow"

# Run concordance test (~45 s)
pytest tests/test_chemotyper_concordance.py -v -s

# Pylint
pylint pyCSRML/

Citation

If you use pyCSRML in academic work, please cite the original ToxPrint / ChemoTyper paper and the TxP_PFAS reference:

Yang, C., et al. (2015). New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modelling. J. Chem. Inf. Model. 55, 510–528.
Richard, A.M., et al. (2023). ToxPrint chemotypes and ChemoTyper portal. Chem. Res. Toxicol. 36, 488–510.

Licence

Acknowledgments

This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4

Apr 8, 2026

0.3

Apr 7, 2026

0.2

Apr 3, 2026

0.1.1

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycsrml-0.4.tar.gz (313.4 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycsrml-0.4-py3-none-any.whl (319.0 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file pycsrml-0.4.tar.gz.

File metadata

Download URL: pycsrml-0.4.tar.gz
Upload date: Apr 8, 2026
Size: 313.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.4.tar.gz
Algorithm	Hash digest
SHA256	`a45cef69d700b6bf405e2175f5f5c9372589fa91f68e7fa3f4ef8903299a1f88`
MD5	`70f5dbbe720d6e77c6ac5c3db1265b81`
BLAKE2b-256	`d9146f373f1a56d60425bfaa1e5a2b9960e6a6918b981c07dae0ca60a7bc859b`

See more details on using hashes here.

File details

Details for the file pycsrml-0.4-py3-none-any.whl.

File metadata

Download URL: pycsrml-0.4-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 319.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0864b711c9ed86804741f2d8f608449f5fa62f28632f6171a0ee45f4fdc114f5`
MD5	`a0130b94f81612dac5094c82c9341d9e`
BLAKE2b-256	`49c601db2c2f9db4d4bc466e2333ceb40003a54345080e2020b11cab9f0b24b5`

See more details on using hashes here.

pyCSRML 0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyCSRML

Performance

Known discrepancies

Timing Benchmark

Timing results (ms / molecule)

How to reproduce

Installation

Quick start

Single molecule (ToxPrint v2.0, 729 bits)

TxP_PFAS fingerprints (129 bits)

Batch fingerprinting

Low-level CSRML parsing

API overview

CSRML features supported

Development

Citation

Licence

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes