Skip to main content

Python implementation of CSRML chemotype fingerprints (ToxPrint v2 and TxP_PFAS v1)

Project description

pyCSRML

CI PyPI Python License: MIT Docs

pyCSRML is a pure-Python re-implementation of the Chemical Subgraph Representation Markup Language (CSRML). It parses CSRML XML files, converts the subgraph patterns to SMARTS, and computes binary chemotype fingerprints for molecules using RDKit.

The module is not an exact replicate of the original CSRML (see performance section). the original software should be preferred.

The module was implemented from two fingerprints descriptions:

Fingerprint Bits Description Sourcde
ToxPrint v2.0 729 General toxicologically relevant substructures Yang et al. 2015
TxP_PFAS v1.0 129 Per- and polyfluoroalkyl substance (PFAS) chemotypes Richard et al. 2023

Performance

Accuracy is measured by comparing pyCSRML bit vectors against the reference ChemoTyper tool output for ToxCast invitrodb v4.3 substances and against data from Richard et al. 2023 (SI Tables S2 and S5).
Run pytest tests/test_chemotyper_concordance.py -v -s to reproduce; the full per-bit breakdown is written to tests/concordance_report.md.

Dataset Compounds Fingerprint Overall accuracy Bits ≥ 90 % acc
Richard et al. 2023 (PFAS set) 14 710 TxP_PFAS v1 99.4 %
ToxCast (full) 9 014 ToxPrint v2 98.17 % 711 / 729
ToxCast (CF-containing subset) 808 TxP_PFAS v1 99.98 % 129 / 129

Reading the table: "CF-containing subset" means only the 808 ToxCast compounds for which ChemoTyper sets at least one TxP_PFAS bit — the meaningful subset for PFAS accuracy benchmarking. Full-dataset TxP_PFAS accuracy appears inflated (100 %) because the vast majority of compounds are all-zero for every PFAS bit.

Known discrepancies

The 18 bits below 90 % accuracy in ToxPrint v2 are all in the metal / inorganic chemotype groups; TxP_PFAS v1 has 4 bits below 100 % (all above 98.9 %).
Root causes (see tests/_check_tsv_alignment.py and tests/concordance_report.md):

Bit / category Fingerprint Accuracy Direction Root cause
atom:element_noble_gas ToxPrint 0.0 % False positives Noble-gas SMARTS approximated as [*] — matches every atom
atom:element_metal_group_III, atom:element_metal_poor_metal, etc. ToxPrint 0.1 – 5 % False positives Metal / metalloid element-group patterns use G/Q pseudo-elements that are approximated as [*], causing widespread false positives
ring:hetero_[6]_N_tetrazine_generic, ring:hetero_[6]_N_triazine_generic ToxPrint 30 – 32 % False positives Nitrogen-count constraints in 6-membered heteroaromatic rings use atom-count SMARTS that over-match similar rings
pfas_chain:alkeneLinear_mono-ene_ethylene_generic_F TxP_PFAS 98.9 % False negatives (recall 40 %) RDKit perceives the C=C of tautomeric fluoropyrimidines (5-fluorouracil) as aromatic; the SMARTS [#9]-[#6;A]=[#6;A] requires aliphatic atoms and misses them
pfas_bond:C=N_imine_FCN TxP_PFAS 99.5 % False negatives (recall 33 %) Same aromaticity issue: the C=N bond in fluorinated heterocycles is perceived as aromatic by RDKit, so the aliphatic imine SMARTS does not match
pfas_bond:aromatic_FCc1c TxP_PFAS 99.5 % Slight false positives (precision 97.2 %) Aromatic F-C pattern slightly over-matches due to SMARTS approximation of the exception clause

Installation

The module needs RDKit installed. If necessary, start by installing a environment manager first (e.g. Conda/Mamba, like Miniforge3) and creating an environment, e.g.:

mamba create -n rdkit pytho
mamba activate rdkit
mamba install -y -c rdkit rdkit

Then install pyCSRML via PyPI:

pip install pyCSRML

Quick start

Single molecule

from pyCSRML import PFASFingerprinter
from rdkit import Chem

fp = PFASFingerprinter()

mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O")  # PFOA
arr, names = fp.fingerprint(mol)

print(f"Bits set: {arr.sum()} / {fp.n_bits}")
on_bits = [names[i] for i in range(len(arr)) if arr[i]]
print(on_bits[:5])

Multiple molecules with analysis

from pyCSRML import PFASFingerprinter, from_fingerprinter

fp = PFASFingerprinter()

smiles = [
    "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O",   # PFOA
    "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)S(=O)(=O)O",  # PFOS
    "FCCCF",    # simple difluoro
    "CCO",      # negative control
]

eset = from_fingerprinter(fp, smiles_list=smiles, names=["PFOA", "PFOS", "4F-propane", "EtOH"])
eset.plot(kind="heatmap")

ToxPrint fingerprints (729 bits)

from pyCSRML import ToxPrintFingerprinter
from rdkit import Chem

fp = ToxPrintFingerprinter()
mol = Chem.MolFromSmiles("c1ccccc1")
arr, names = fp.fingerprint(mol)
print(f"Benzene: {arr.sum()} bits set")

Low-level CSRML parsing

from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list

data = parse_csrml_xml("path/to/my_fingerprints.xml")
bits = ordered_bit_list(data)
for bit in bits[:3]:
    print(bit["id"], bit["smarts"])

API overview

Class / function Module Description
PFASFingerprinter pyCSRML 129-bit TxP_PFAS fingerprinter
ToxPrintFingerprinter pyCSRML 729-bit ToxPrint v2 fingerprinter
Fingerprinter pyCSRML Base class; load any CSRML XML or JSON
Embedding pyCSRML Single-compound fingerprint container with metadata
EmbeddingSet pyCSRML Multi-compound container with heatmap / UMAP / clustering
from_fingerprinter pyCSRML Convenience factory: list of SMILES → EmbeddingSet
parse_csrml_xml pyCSRML._csrml Parse raw CSRML XML → Python dict
ordered_bit_list pyCSRML._csrml Return all bits in order from a parsed dict

Full API reference: pycsrml.readthedocs.io


CSRML features supported

Feature Status
substructureMatch → SMARTS ✅ Full
substructureException (global) ✅ Full
matchingQueryAtom[!$(...)] folding ✅ Full
combineAtomFeatures (OR-of-AND trees) ✅ Full
atomList with negate="true" ✅ Full
attachedHydrogenCount ranges ✅ Full
ringCountAtom / ringAtom / chainAtom ✅ Full
Pseudo-elements G, Z, Q, X ✅ Full
mustMatch / mustNotMatch (test cases) parsed, not used for matching

Development

git clone https://github.com/luc-miaz/pyCSRML
cd pyCSRML
pip install -e ".[dev]"

# Run tests (fast)
pytest -m "not slow"

# Run concordance test (~45 s)
pytest tests/test_chemotyper_concordance.py -v -s

# Pylint
pylint pyCSRML/

Citation

If you use pyCSRML in academic work, please cite the original ToxPrint / ChemoTyper paper and the TxP_PFAS reference:

  • Yang, C., et al. (2015). New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modelling. J. Chem. Inf. Model. 55, 510–528.
  • Richard, A.M., et al. (2023). ToxPrint chemotypes and ChemoTyper portal. Chem. Res. Toxicol. 36, 488–510.

Licence

pyCSRML © 1999 by Luc T. Miaz is licensed under CC BY 4.0

Acknowledgments

While working on this project I was part of the ZeroPM project (WP2) and received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo zeropm logozeropm logo

Powered by RDKit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycsrml-0.1.1.tar.gz (302.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycsrml-0.1.1-py3-none-any.whl (323.3 kB view details)

Uploaded Python 3

File details

Details for the file pycsrml-0.1.1.tar.gz.

File metadata

  • Download URL: pycsrml-0.1.1.tar.gz
  • Upload date:
  • Size: 302.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.1.1.tar.gz
Algorithm Hash digest
SHA256 da1d668c625064cd026202e687eabb5472d6c652b75dfe754890c43792896090
MD5 bdfd4d786985152a3979f5409c545b20
BLAKE2b-256 efa88b31560b1bf65770122a2beda26b455562238337b3a84346e9822c0f87b4

See more details on using hashes here.

File details

Details for the file pycsrml-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pycsrml-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 323.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycsrml-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fadbde9df0319ba2e55b0a123b8204cb5894a10ec2056536e9585ccc83926e75
MD5 1cdac5cab395de79e81b944ac2f5c697
BLAKE2b-256 bd2a6f84b1afca93a10bd779895c5209eb8ffe54c68c3c3db29b3c5d6bd830d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page