Skip to main content

Unofficial Python implementation of CSRML chemotype fingerprints (ToxPrint v2 and TxP_PFAS v1).

Project description

pyCSRML

CI PyPI Python License: MIT Docs

pyCSRML is a pure-Python re-implementation of the Chemical Subgraph Representation Markup Language (CSRML). It parses CSRML XML files, converts the subgraph patterns to SMARTS, and computes binary chemotype fingerprints for molecules using RDKit.

The module is not an exact replicate of the original CSRML (see performance section). the original software should be preferred.

The module was implemented from two fingerprints descriptions:

Fingerprint Bits Description Sourcde
ToxPrint v2.0 729 General toxicologically relevant substructures Yang et al. 2015
TxP_PFAS v1.0 129 Per- and polyfluoroalkyl substance (PFAS) chemotypes Richard et al. 2023

Disclaimer

This module is not affiliated with the authors of the original CSRML project. Visit Chemotyper repository if you want to access and use the original CSRML.

Performance

Accuracy is measured by comparing pyCSRML bit vectors against the reference ChemoTyper tool output for ToxCast invitrodb v4.3 substances and against data from Richard et al. 2023 (SI Tables S2 and S5).
Run pytest tests/test_chemotyper_concordance.py -v -s to reproduce; the full per-bit breakdown is written to tests/concordance_report.md.

Dataset Compounds Fingerprint Overall accuracy Bits ≥ 90 % acc
Richard et al. 2023 (PFAS set) 14 710 TxP_PFAS v1 99.4 %
ToxCast (full) 9 014 ToxPrint v2 98.17 % 711 / 729
ToxCast (CF-containing subset) 808 TxP_PFAS v1 99.98 % 129 / 129

Reading the table: "CF-containing subset" means only the 808 ToxCast compounds for which ChemoTyper sets at least one TxP_PFAS bit — the meaningful subset for PFAS accuracy benchmarking. Full-dataset TxP_PFAS accuracy appears inflated (100 %) because the vast majority of compounds are all-zero for every PFAS bit.

Known discrepancies

The 18 bits below 90 % accuracy in ToxPrint v2 are all in the metal / inorganic chemotype groups; TxP_PFAS v1 has 4 bits below 100 % (all above 98.9 %).
Root causes (see tests/_check_tsv_alignment.py and tests/concordance_report.md):

Bit / category Fingerprint Accuracy Direction Root cause
atom:element_noble_gas ToxPrint 0.0 % False positives Noble-gas SMARTS approximated as [*] — matches every atom
atom:element_metal_group_III, atom:element_metal_poor_metal, etc. ToxPrint 0.1 – 5 % False positives Metal / metalloid element-group patterns use G/Q pseudo-elements that are approximated as [*], causing widespread false positives
ring:hetero_[6]_N_tetrazine_generic, ring:hetero_[6]_N_triazine_generic ToxPrint 30 – 32 % False positives Nitrogen-count constraints in 6-membered heteroaromatic rings use atom-count SMARTS that over-match similar rings
pfas_chain:alkeneLinear_mono-ene_ethylene_generic_F TxP_PFAS 98.9 % False negatives (recall 40 %) RDKit perceives the C=C of tautomeric fluoropyrimidines (5-fluorouracil) as aromatic; the SMARTS [#9]-[#6;A]=[#6;A] requires aliphatic atoms and misses them
pfas_bond:C=N_imine_FCN TxP_PFAS 99.5 % False negatives (recall 33 %) Same aromaticity issue: the C=N bond in fluorinated heterocycles is perceived as aromatic by RDKit, so the aliphatic imine SMARTS does not match
pfas_bond:aromatic_FCc1c TxP_PFAS 99.5 % Slight false positives (precision 97.2 %) Aromatic F-C pattern slightly over-matches due to SMARTS approximation of the exception clause

Installation

The module needs RDKit installed. If necessary, start by installing a environment manager first (e.g. Conda/Mamba, like Miniforge3) and creating an environment, e.g.:

mamba create -n rdkit pytho
mamba activate rdkit
mamba install -y -c rdkit rdkit

Then install pyCSRML via PyPI:

pip install pyCSRML

Quick start

Single molecule

from pyCSRML import PFASFingerprinter
from rdkit import Chem

fp = PFASFingerprinter()

mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O")  # PFOA
arr, names = fp.fingerprint(mol)

print(f"Bits set: {arr.sum()} / {fp.n_bits}")
on_bits = [names[i] for i in range(len(arr)) if arr[i]]
print(on_bits[:5])

Multiple molecules with analysis

from pyCSRML import PFASFingerprinter, from_fingerprinter

fp = PFASFingerprinter()

smiles = [
    "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O",   # PFOA
    "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)S(=O)(=O)O",  # PFOS
    "FCCCF",    # simple difluoro
    "CCO",      # negative control
]

eset = from_fingerprinter(fp, smiles_list=smiles, names=["PFOA", "PFOS", "4F-propane", "EtOH"])
eset.plot(kind="heatmap")

ToxPrint fingerprints (729 bits)

from pyCSRML import ToxPrintFingerprinter
from rdkit import Chem

fp = ToxPrintFingerprinter()
mol = Chem.MolFromSmiles("c1ccccc1")
arr, names = fp.fingerprint(mol)
print(f"Benzene: {arr.sum()} bits set")

Low-level CSRML parsing

from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list

data = parse_csrml_xml("path/to/my_fingerprints.xml")
bits = ordered_bit_list(data)
for bit in bits[:3]:
    print(bit["id"], bit["smarts"])

API overview

Class / function Module Description
PFASFingerprinter pyCSRML 129-bit TxP_PFAS fingerprinter
ToxPrintFingerprinter pyCSRML 729-bit ToxPrint v2 fingerprinter
Fingerprinter pyCSRML Base class; load any CSRML XML or JSON
Embedding pyCSRML Single-compound fingerprint container with metadata
EmbeddingSet pyCSRML Multi-compound container with heatmap / UMAP / clustering
from_fingerprinter pyCSRML Convenience factory: list of SMILES → EmbeddingSet
parse_csrml_xml pyCSRML._csrml Parse raw CSRML XML → Python dict
ordered_bit_list pyCSRML._csrml Return all bits in order from a parsed dict

Full API reference: pycsrml.readthedocs.io


CSRML features supported

Feature Status
substructureMatch → SMARTS ✅ Full
substructureException (global) ✅ Full
matchingQueryAtom[!$(...)] folding ✅ Full
combineAtomFeatures (OR-of-AND trees) ✅ Full
atomList with negate="true" ✅ Full
attachedHydrogenCount ranges ✅ Full
ringCountAtom / ringAtom / chainAtom ✅ Full
Pseudo-elements G, Z, Q, X ✅ Full
mustMatch / mustNotMatch (test cases) parsed, not used for matching

Development

git clone https://github.com/luc-miaz/pyCSRML
cd pyCSRML
pip install -e ".[dev]"

# Run tests (fast)
pytest -m "not slow"

# Run concordance test (~45 s)
pytest tests/test_chemotyper_concordance.py -v -s

# Pylint
pylint pyCSRML/

Citation

If you use pyCSRML in academic work, please cite the original ToxPrint / ChemoTyper paper and the TxP_PFAS reference:

  • Yang, C., et al. (2015). New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modelling. J. Chem. Inf. Model. 55, 510–528.
  • Richard, A.M., et al. (2023). ToxPrint chemotypes and ChemoTyper portal. Chem. Res. Toxicol. 36, 488–510.

Licence

pyCSRML © 1999 by Luc T. Miaz is licensed under CC BY 4.0

Acknowledgments

While working on this project I was part of the ZeroPM project (WP2) and received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo zeropm logozeropm logo

Powered by RDKit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycsrml-0.2.tar.gz (302.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycsrml-0.2-py3-none-any.whl (323.4 kB view details)

Uploaded Python 3

File details

Details for the file pycsrml-0.2.tar.gz.

File metadata

  • Download URL: pycsrml-0.2.tar.gz
  • Upload date:
  • Size: 302.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.2.tar.gz
Algorithm Hash digest
SHA256 420be552d21766e22dc02ffcb51589e3e927bd4fd58e5ba6c002465d87caa2b5
MD5 cd81bdb53f9596ecfe578d37978d747d
BLAKE2b-256 397100143f156e25a57d153257a07a97c87b9d3258ed676e121a9396bd4f1969

See more details on using hashes here.

File details

Details for the file pycsrml-0.2-py3-none-any.whl.

File metadata

  • Download URL: pycsrml-0.2-py3-none-any.whl
  • Upload date:
  • Size: 323.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5bac7356a1fbded0f6b0f51755742b799e2230dc831a40179c82a3fcb4d73a9d
MD5 cdbf0b1385f5542c6eb2990e4f96b777
BLAKE2b-256 8b8c53c55aae9576443687b5f9f9945d0956f0f1b1d313006059c4938a9a60e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page