Skip to main content

Unofficial Python implementation of CSRML chemotype fingerprints (ToxPrint v2 and TxP_PFAS v1).

Project description

pyCSRML

CI PyPI Python License: CC BY 4.0 Docs

pyCSRML is a pure-Python re-implementation of the Chemical Subgraph Representation Markup Language (CSRML). It parses CSRML XML files, converts the subgraph patterns to SMARTS, and computes binary chemotype fingerprints for molecules using RDKit.

The module is not an exact replicate of the original CSRML (see performance section). the original software should be preferred.

The module was implemented from two fingerprints descriptions:

Fingerprint Bits Description Sourcde
ToxPrint v2.0 729 General toxicologically relevant substructures Yang et al. 2015
TxP_PFAS v1.0 129 Per- and polyfluoroalkyl substance (PFAS) chemotypes Richard et al. 2023

Performance

Accuracy is measured by comparing pyCSRML bit vectors against the reference ChemoTyper tool output. Run pytest tests/test_chemotyper_concordance.py -v -s to reproduce; the full per-bit breakdown is written to tests/concordance_report.md.

Dataset Compounds Fingerprint Overall accuracy Bits ≥ 90 % acc Macro MCC Macro Bal Acc Macro ROC-AUC
Richard et al. 2023 (PFAS set) 14 710 TxP_PFAS v1 99.99 % 129 / 129 0.9971 0.9989 0.9989
ToxCast (full) 9 014 ToxPrint v2 99.71 % 725 / 729 0.9326 0.9703 0.9703
ToxCast (CF-containing subset) 808 TxP_PFAS v1 99.98 % 129 / 129 0.9905 0.9924 0.9924
CLinventory 181 745 ToxPrint v2 99.77 % 726 / 729 0.9320 0.9710 0.9710
CLinventory 181 745 TxP_PFAS v1 100.00 % 129 / 129 0.9936 0.9946 0.9946

Reading the table: "CF-containing subset" means only the 808 ToxCast compounds for which ChemoTyper sets at least one TxP_PFAS bit — the meaningful subset for PFAS accuracy benchmarking. Full-dataset TxP_PFAS accuracy appears inflated (100 %) because the vast majority of compounds are all-zero for every PFAS bit.

Known discrepancies

The 4 bits below 90 % accuracy in ToxPrint v2 are in ring heteroatom and chain chemotype groups; TxP_PFAS v1 has 3 bits below 100 % (all above 98.9 %).
Root causes (see tests/concordance_report.md):

Bit / category Fingerprint Accuracy Direction Root cause
ring:hetero_[6]_Z_generic ToxPrint 54.6 % False positives Over-broad 6-membered heteroatom-ring SMARTS; pyCSRML prevalence 69.6 % vs ChemoTyper 24.3 %
chain:alkaneBranch_isopropyl_C3 ToxPrint 74.1 % False positives Ring-attachment SMARTS permissive on noZ (not-connected-to-heteroatom) modifier; pyCSRML prevalence 37.5 % vs ChemoTyper 11.6 %
chain:alkaneCyclic_ethyl_C2_(connect_noZ) ToxPrint 75.9 % False positives Same noZ over-matching; pyCSRML prevalence 41.8 % vs ChemoTyper 17.7 %
chain:alkeneCyclic_ethene_generic ToxPrint 87.0 % False positives Cyclic alkene SMARTS over-matches; pyCSRML prevalence 17.4 % vs ChemoTyper 10.3 %
pfas_chain:alkeneLinear_mono-ene_ethylene_generic_F TxP_PFAS 98.9 % False negatives (recall 40 %) RDKit perceives the C=C of tautomeric fluoropyrimidines (5-fluorouracil) as aromatic; the SMARTS [#9]-[#6;A]=[#6;A] requires aliphatic atoms and misses them
pfas_bond:C=N_imine_FCN TxP_PFAS 99.5 % False negatives (recall 33 %) Same aromaticity issue: the C=N bond in fluorinated heterocycles is perceived as aromatic by RDKit, so the aliphatic imine SMARTS does not match
pfas_bond:aromatic_FCc1c TxP_PFAS 99.5 % Slight false positives (precision 97.2 %) Aromatic F-C pattern slightly over-matches due to SMARTS approximation of the exception clause

Timing Benchmark

Five molecule-size-stratified sets are extracted from the CLinventory and used to compare pyCSRML speed against ChemoTyper on realistic chemical diversity.

Set Heavy-atom range Molecules
bench_tiny 1 – 10 auto
bench_small 11 – 20 auto
bench_medium 21 – 35 auto
bench_large 36 – 60 auto
bench_xlarge 61 + auto

Timing results (ms / molecule)

Set Heavy atoms pyCSRML ToxPrint v2 pyCSRML TxP_PFAS v1 ChemoTyper ToxPrint v2 ChemoTyper TxP_PFAS v1
bench_tiny 1 – 10 3.76 0.73 13.83 4.29
bench_small 11 – 20 5.47 1.01 27.70 7.74
bench_medium 21 – 35 8.23 1.53 59.63 17.87
bench_large 36 – 60 12.32 2.19 114.64 30.53
bench_xlarge 61 + 23.20 4.46 322.33 139.09

pyCSRML measured on Snapdragon X Elite X1E78100 (ARM64, 12 cores, ~32 GB RAM), Python 3.14.2, RDKit 2025.09.3, NumPy 2.3.5; 5 repetitions, median reported. ChemoTyper timings measured manually on the same machine, 3 repetitions, mean reported; values are of limited precision due to the manual measurement procedure. 500 molecules per set.

How to reproduce

1. Extract benchmark sets (one-time):

python scripts/create_size_benchmarks.py

Outputs tests/test_data/size_benchmarks/bench_*.smiles, bench_metadata.csv, and chemotyper_timing_template.csv.

2. Time pyCSRML (saves pycsrml_timing_baseline.json):

python scripts/benchmark_pycsrml_timing.py          # 5 reps by default
python scripts/benchmark_pycsrml_timing.py --reps 3 # faster

3. Run ChemoTyper on each .smiles file (ToxPrint V2 and TxP_PFAS v1), export results as TSV, and place zips in tests/test_data/size_benchmarks/:

bench_tiny_toxprint.zip    bench_tiny_txppfas.zip
bench_small_toxprint.zip   bench_small_txppfas.zip
...

Fill in the three-repetition ChemoTyper timing in chemotyper_timing_template.csv.

4. Run regression tests:

pytest tests/test_benchmark_regression.py -v -m slow

Timing regression: fails if any set is >30 % slower than the saved baseline.
Accuracy regression: fails if overall bit accuracy drops >0.1 pp from baseline. Both test types skip gracefully until their respective baseline / zip files exist.


Installation

The module needs RDKit installed. If necessary, start by installing a environment manager first (e.g. Conda/Mamba, like Miniforge3) and creating an environment, e.g.:

mamba create -n rdkit pytho
mamba activate rdkit
mamba install -y -c rdkit rdkit

Then install pyCSRML via PyPI:

pip install pyCSRML

Quick start

Single molecule (ToxPrint v2.0, 729 bits)

from pyCSRML import Fingerprinter, TOXPRINT_PATH
from rdkit import Chem

fp = Fingerprinter(TOXPRINT_PATH)

mol = Chem.MolFromSmiles("c1ccccc1")   # benzene
arr, names = fp.fingerprint(mol)

print(f"Bits set: {arr.sum()} / {fp.n_bits}")
on_bits = [names[i] for i in range(len(arr)) if arr[i]]
print(on_bits[:5])

TxP_PFAS fingerprints (129 bits)

from pyCSRML import Fingerprinter, TXPPFAS_PATH
from rdkit import Chem

fp = Fingerprinter(TXPPFAS_PATH)
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O")  # PFOA
arr, names = fp.fingerprint(mol)
print(f"Bits set: {arr.sum()} / {fp.n_bits}")

Batch fingerprinting

from pyCSRML import Fingerprinter, TXPPFAS_PATH
from rdkit import Chem

mols = [Chem.MolFromSmiles(s) for s in smiles_list]
fp = Fingerprinter(TXPPFAS_PATH)
matrix = fp.fingerprint_batch(mols)   # shape (n_mols, 129), dtype bool
print(matrix.shape)

Low-level CSRML parsing

from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list

data = parse_csrml_xml("path/to/my_fingerprints.xml")
bits = ordered_bit_list(data)
for bit in bits[:3]:
    print(bit["id"], bit["smarts"])

API overview

Symbol Module Description
Fingerprinter pyCSRML Compute fingerprints from any CSRML XML, JSON, or YAML definition
TOXPRINT_PATH pyCSRML Path to the bundled ToxPrint v2.0 JSON (729 bits)
TXPPFAS_PATH pyCSRML Path to the bundled TxP_PFAS v1.0.4 JSON (129 bits)
parse_csrml_xml pyCSRML._csrml Parse raw CSRML XML → Python dict
ordered_bit_list pyCSRML._csrml Return all bits in order from a parsed dict

Full API reference: pycsrml.readthedocs.io


CSRML features supported

Feature Status
substructureMatch → SMARTS ✅ Full
substructureException (global) ✅ Full
matchingQueryAtom[!$(...)] folding ✅ Full
combineAtomFeatures (OR-of-AND trees) ✅ Full
atomList with negate="true" ✅ Full
attachedHydrogenCount ranges ✅ Full
ringCountAtom / ringAtom / chainAtom ✅ Full
Pseudo-elements G, Z, Q, X ✅ Full
mustMatch / mustNotMatch (test cases) parsed, not used for matching

Development

git clone https://github.com/LucMiaz/pyCSRML
cd pyCSRML
pip install -e ".[dev]"

# Run tests (fast)
pytest -m "not slow"

# Run concordance test (~45 s)
pytest tests/test_chemotyper_concordance.py -v -s

# Pylint
pylint pyCSRML/

Citation

If you use pyCSRML in academic work, please cite the original ToxPrint / ChemoTyper paper and the TxP_PFAS reference:

  • Yang, C., et al. (2015). New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modelling. J. Chem. Inf. Model. 55, 510–528.
  • Richard, A.M., et al. (2023). ToxPrint chemotypes and ChemoTyper portal. Chem. Res. Toxicol. 36, 488–510.

Licence

pyCSRML © 1999 by Luc T. Miaz is licensed under CC BY 4.0

Acknowledgments

This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo zeropm logozeropm logo

Powered by RDKit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycsrml-0.4.tar.gz (313.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycsrml-0.4-py3-none-any.whl (319.0 kB view details)

Uploaded Python 3

File details

Details for the file pycsrml-0.4.tar.gz.

File metadata

  • Download URL: pycsrml-0.4.tar.gz
  • Upload date:
  • Size: 313.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.4.tar.gz
Algorithm Hash digest
SHA256 a45cef69d700b6bf405e2175f5f5c9372589fa91f68e7fa3f4ef8903299a1f88
MD5 70f5dbbe720d6e77c6ac5c3db1265b81
BLAKE2b-256 d9146f373f1a56d60425bfaa1e5a2b9960e6a6918b981c07dae0ca60a7bc859b

See more details on using hashes here.

File details

Details for the file pycsrml-0.4-py3-none-any.whl.

File metadata

  • Download URL: pycsrml-0.4-py3-none-any.whl
  • Upload date:
  • Size: 319.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0864b711c9ed86804741f2d8f608449f5fa62f28632f6171a0ee45f4fdc114f5
MD5 a0130b94f81612dac5094c82c9341d9e
BLAKE2b-256 49c601db2c2f9db4d4bc466e2333ceb40003a54345080e2020b11cab9f0b24b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page