Unofficial Python implementation of CSRML chemotype fingerprints (ToxPrint v2 and TxP_PFAS v1).
Project description
pyCSRML
pyCSRML is a pure-Python re-implementation of the Chemical Subgraph Representation Markup Language (CSRML). It parses CSRML XML files, converts the subgraph patterns to SMARTS, and computes binary chemotype fingerprints for molecules using RDKit.
The module is not an exact replicate of the original CSRML (see performance section). the original software should be preferred.
The module was implemented from two fingerprints descriptions:
| Fingerprint | Bits | Description | Sourcde |
|---|---|---|---|
| ToxPrint v2.0 | 729 | General toxicologically relevant substructures | Yang et al. 2015 |
| TxP_PFAS v1.0 | 129 | Per- and polyfluoroalkyl substance (PFAS) chemotypes | Richard et al. 2023 |
Performance
Accuracy is measured by comparing pyCSRML bit vectors against the reference
ChemoTyper tool output.
Run pytest tests/test_chemotyper_concordance.py -v -s to reproduce; the full
per-bit breakdown is written to tests/concordance_report.md.
| Dataset | Compounds | Fingerprint | Overall accuracy | Bits ≥ 90 % acc | Macro MCC | Macro Bal Acc | Macro ROC-AUC |
|---|---|---|---|---|---|---|---|
| Richard et al. 2023 (PFAS set) | 14 710 | TxP_PFAS v1 | 99.99 % | 129 / 129 | 0.9971 | 0.9989 | 0.9989 |
| ToxCast (full) | 9 014 | ToxPrint v2 | 99.71 % | 725 / 729 | 0.9326 | 0.9703 | 0.9703 |
| ToxCast (CF-containing subset) | 808 | TxP_PFAS v1 | 99.98 % | 129 / 129 | 0.9905 | 0.9924 | 0.9924 |
| CLinventory | 181 745 | ToxPrint v2 | 99.77 % | 726 / 729 | 0.9320 | 0.9710 | 0.9710 |
| CLinventory | 181 745 | TxP_PFAS v1 | 100.00 % | 129 / 129 | 0.9936 | 0.9946 | 0.9946 |
Reading the table: "CF-containing subset" means only the 808 ToxCast compounds for which ChemoTyper sets at least one TxP_PFAS bit — the meaningful subset for PFAS accuracy benchmarking. Full-dataset TxP_PFAS accuracy appears inflated (100 %) because the vast majority of compounds are all-zero for every PFAS bit.
Known discrepancies
The 4 bits below 90 % accuracy in ToxPrint v2 are in ring heteroatom and chain
chemotype groups; TxP_PFAS v1 has 3 bits below 100 % (all above 98.9 %).
Root causes (see tests/concordance_report.md):
| Bit / category | Fingerprint | Accuracy | Direction | Root cause |
|---|---|---|---|---|
ring:hetero_[6]_Z_generic |
ToxPrint | 54.6 % | False positives | Over-broad 6-membered heteroatom-ring SMARTS; pyCSRML prevalence 69.6 % vs ChemoTyper 24.3 % |
chain:alkaneBranch_isopropyl_C3 |
ToxPrint | 74.1 % | False positives | Ring-attachment SMARTS permissive on noZ (not-connected-to-heteroatom) modifier; pyCSRML prevalence 37.5 % vs ChemoTyper 11.6 % |
chain:alkaneCyclic_ethyl_C2_(connect_noZ) |
ToxPrint | 75.9 % | False positives | Same noZ over-matching; pyCSRML prevalence 41.8 % vs ChemoTyper 17.7 % |
chain:alkeneCyclic_ethene_generic |
ToxPrint | 87.0 % | False positives | Cyclic alkene SMARTS over-matches; pyCSRML prevalence 17.4 % vs ChemoTyper 10.3 % |
pfas_chain:alkeneLinear_mono-ene_ethylene_generic_F |
TxP_PFAS | 98.9 % | False negatives (recall 40 %) | RDKit perceives the C=C of tautomeric fluoropyrimidines (5-fluorouracil) as aromatic; the SMARTS [#9]-[#6;A]=[#6;A] requires aliphatic atoms and misses them |
pfas_bond:C=N_imine_FCN |
TxP_PFAS | 99.5 % | False negatives (recall 33 %) | Same aromaticity issue: the C=N bond in fluorinated heterocycles is perceived as aromatic by RDKit, so the aliphatic imine SMARTS does not match |
pfas_bond:aromatic_FCc1c |
TxP_PFAS | 99.5 % | Slight false positives (precision 97.2 %) | Aromatic F-C pattern slightly over-matches due to SMARTS approximation of the exception clause |
Timing Benchmark
Five molecule-size-stratified sets are extracted from the CLinventory and used to compare pyCSRML speed against ChemoTyper on realistic chemical diversity.
| Set | Heavy-atom range | Molecules |
|---|---|---|
bench_tiny |
1 – 10 | auto |
bench_small |
11 – 20 | auto |
bench_medium |
21 – 35 | auto |
bench_large |
36 – 60 | auto |
bench_xlarge |
61 + | auto |
Timing results (ms / molecule)
| Set | Heavy atoms | pyCSRML ToxPrint v2 | pyCSRML TxP_PFAS v1 | ChemoTyper ToxPrint v2 | ChemoTyper TxP_PFAS v1 |
|---|---|---|---|---|---|
| bench_tiny | 1 – 10 | 3.76 | 0.73 | 13.83 | 4.29 |
| bench_small | 11 – 20 | 5.47 | 1.01 | 27.70 | 7.74 |
| bench_medium | 21 – 35 | 8.23 | 1.53 | 59.63 | 17.87 |
| bench_large | 36 – 60 | 12.32 | 2.19 | 114.64 | 30.53 |
| bench_xlarge | 61 + | 23.20 | 4.46 | 322.33 | 139.09 |
pyCSRML measured on Snapdragon X Elite X1E78100 (ARM64, 12 cores, ~32 GB RAM), Python 3.14.2, RDKit 2025.09.3, NumPy 2.3.5; 5 repetitions, median reported. ChemoTyper timings measured manually on the same machine, 3 repetitions, mean reported; values are of limited precision due to the manual measurement procedure. 500 molecules per set.
How to reproduce
1. Extract benchmark sets (one-time):
python scripts/create_size_benchmarks.py
Outputs tests/test_data/size_benchmarks/bench_*.smiles,
bench_metadata.csv, and chemotyper_timing_template.csv.
2. Time pyCSRML (saves pycsrml_timing_baseline.json):
python scripts/benchmark_pycsrml_timing.py # 5 reps by default
python scripts/benchmark_pycsrml_timing.py --reps 3 # faster
3. Run ChemoTyper on each .smiles file (ToxPrint V2 and TxP_PFAS v1),
export results as TSV, and place zips in tests/test_data/size_benchmarks/:
bench_tiny_toxprint.zip bench_tiny_txppfas.zip
bench_small_toxprint.zip bench_small_txppfas.zip
...
Fill in the three-repetition ChemoTyper timing in chemotyper_timing_template.csv.
4. Run regression tests:
pytest tests/test_benchmark_regression.py -v -m slow
Timing regression: fails if any set is >30 % slower than the saved baseline.
Accuracy regression: fails if overall bit accuracy drops >0.1 pp from baseline.
Both test types skip gracefully until their respective baseline / zip files exist.
Installation
The module needs RDKit installed. If necessary, start by installing a environment manager first (e.g. Conda/Mamba, like Miniforge3) and creating an environment, e.g.:
mamba create -n rdkit pytho
mamba activate rdkit
mamba install -y -c rdkit rdkit
Then install pyCSRML via PyPI:
pip install pyCSRML
Quick start
Single molecule (ToxPrint v2.0, 729 bits)
from pyCSRML import Fingerprinter, TOXPRINT_PATH
from rdkit import Chem
fp = Fingerprinter(TOXPRINT_PATH)
mol = Chem.MolFromSmiles("c1ccccc1") # benzene
arr, names = fp.fingerprint(mol)
print(f"Bits set: {arr.sum()} / {fp.n_bits}")
on_bits = [names[i] for i in range(len(arr)) if arr[i]]
print(on_bits[:5])
TxP_PFAS fingerprints (129 bits)
from pyCSRML import Fingerprinter, TXPPFAS_PATH
from rdkit import Chem
fp = Fingerprinter(TXPPFAS_PATH)
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O") # PFOA
arr, names = fp.fingerprint(mol)
print(f"Bits set: {arr.sum()} / {fp.n_bits}")
Batch fingerprinting
from pyCSRML import Fingerprinter, TXPPFAS_PATH
from rdkit import Chem
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
fp = Fingerprinter(TXPPFAS_PATH)
matrix = fp.fingerprint_batch(mols) # shape (n_mols, 129), dtype bool
print(matrix.shape)
Low-level CSRML parsing
from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list
data = parse_csrml_xml("path/to/my_fingerprints.xml")
bits = ordered_bit_list(data)
for bit in bits[:3]:
print(bit["id"], bit["smarts"])
API overview
| Symbol | Module | Description |
|---|---|---|
Fingerprinter |
pyCSRML |
Compute fingerprints from any CSRML XML, JSON, or YAML definition |
TOXPRINT_PATH |
pyCSRML |
Path to the bundled ToxPrint v2.0 JSON (729 bits) |
TXPPFAS_PATH |
pyCSRML |
Path to the bundled TxP_PFAS v1.0.4 JSON (129 bits) |
parse_csrml_xml |
pyCSRML._csrml |
Parse raw CSRML XML → Python dict |
ordered_bit_list |
pyCSRML._csrml |
Return all bits in order from a parsed dict |
Full API reference: pycsrml.readthedocs.io
CSRML features supported
| Feature | Status |
|---|---|
substructureMatch → SMARTS |
✅ Full |
substructureException (global) |
✅ Full |
matchingQueryAtom → [!$(...)] folding |
✅ Full |
combineAtomFeatures (OR-of-AND trees) |
✅ Full |
atomList with negate="true" |
✅ Full |
attachedHydrogenCount ranges |
✅ Full |
ringCountAtom / ringAtom / chainAtom |
✅ Full |
| Pseudo-elements G, Z, Q, X | ✅ Full |
mustMatch / mustNotMatch (test cases) |
parsed, not used for matching |
Development
git clone https://github.com/LucMiaz/pyCSRML
cd pyCSRML
pip install -e ".[dev]"
# Run tests (fast)
pytest -m "not slow"
# Run concordance test (~45 s)
pytest tests/test_chemotyper_concordance.py -v -s
# Pylint
pylint pyCSRML/
Citation
If you use pyCSRML in academic work, please cite the original ToxPrint / ChemoTyper paper and the TxP_PFAS reference:
- Yang, C., et al. (2015). New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modelling. J. Chem. Inf. Model. 55, 510–528.
- Richard, A.M., et al. (2023). ToxPrint chemotypes and ChemoTyper portal. Chem. Res. Toxicol. 36, 488–510.
Licence
pyCSRML © 1999 by Luc T. Miaz is licensed under CC BY 4.0
Acknowledgments
This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycsrml-0.4.tar.gz.
File metadata
- Download URL: pycsrml-0.4.tar.gz
- Upload date:
- Size: 313.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a45cef69d700b6bf405e2175f5f5c9372589fa91f68e7fa3f4ef8903299a1f88
|
|
| MD5 |
70f5dbbe720d6e77c6ac5c3db1265b81
|
|
| BLAKE2b-256 |
d9146f373f1a56d60425bfaa1e5a2b9960e6a6918b981c07dae0ca60a7bc859b
|
File details
Details for the file pycsrml-0.4-py3-none-any.whl.
File metadata
- Download URL: pycsrml-0.4-py3-none-any.whl
- Upload date:
- Size: 319.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0864b711c9ed86804741f2d8f608449f5fa62f28632f6171a0ee45f4fdc114f5
|
|
| MD5 |
a0130b94f81612dac5094c82c9341d9e
|
|
| BLAKE2b-256 |
49c601db2c2f9db4d4bc466e2333ceb40003a54345080e2020b11cab9f0b24b5
|