Skip to main content

A comprehensive cheminformatics package for automated detection, classification, and analysis of halogenated substances, with a focus on per- and polyfluoroalkyl substances (PFAS). Combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding to identify 119 groups. Supports creations of embeddings with graph metrics for Machine Learning workflows.

Project description

PFASGroups

A comprehensive cheminformatics package for automated detection, classification, and analysis of halogenated substances, in particular per- and polyfluoroalkyl substances (PFAS) in chemical databases.

Overview

PFASGroups combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding (using RDKit and NetworkX) to identify and classify PFAS compounds. The package enables systematic PFAS universe mapping and environmental monitoring applications.

Key Features

Core Capabilities

  • Halogen Group Identification: Automated detection of 119 functional groups (114 compiled for fluorine-only embedding):
    • 27 PFAS OECD groups
    • 48 generic functional groups (IDs 29–73, 117–119; the last 3 are halogen-context-dependent or recently added)
    • 43 fluorotelomer-specific groups with CH₂ linker validation
    • 1 aggregate pattern-matching group (Group 116: Telomers, compute=False)
  • Atom Reference Requirement: For non-telomer groups, SMARTS patterns must match atoms that are part of or directly connected to the fluorinated component (per/polyfluorinated carbons), respecting the max_dist_from_comp constraint
  • Linker Validation: CH₂-specific validation for 40 fluorotelomer groups to distinguish from direct-attachment analogues. Telomer groups use linker_smarts to allow functional groups separated from perfluoro chains by non-fluorinated linkers
  • Aggregate Groups: Pattern-matching groups that collect related PFAS groups via regex (e.g., Group 113 matches all "telomer" groups)
  • Component Length Analysis: Quantification of per- and polyfluorinated alkyl components with CF₂ unit counting
  • Graph Metrics: Comprehensive structural characterization (branching, eccentricity, diameter, resistance, centrality, telomer spacer length, ring size)
  • Customizable Definitions: Easy extension to additional PFAS groups and halogenated chemical classes via JSON configuration. Component type definitions in data/component_smarts_halogens.json support optional per-type constraints (e.g. {"gte": {"F": 2}}) that are evaluated against the component's full atom count (backbone carbons plus attached halogens) to enforce minimum halogen counts or element exclusions — without repeating the check in every group definition

Additional Tools

  • Homologue Series Generation: Iterative component shortening to explore theoretical chemical space
  • Fingerprint Generation: PFAS fingerprints for machine learning applications
  • Visualization: Assign and visualize PFAS groupings
  • Multiple Interfaces: Python API, command-line tool, and browser-based JavaScript version (RDKitJS)
  • Batch Processing: Efficient analysis of large chemical databases

Installation

The recommended installation method is via pip. RDKit must already be available in the target environment. It is recommended to use an environment manager (like Conda/Mamba, e.g Miniforge) and install RDKit via

mamba install -y -c rdkit rdkit

From PyPI

PFASGroups is available on PyPI:

pip install PFASGroups

From source (recommended for development)

Prerequisites: Python ≥ 3.7, RDKit (install via conda or pip before the steps below).

# Clone the repository
git clone https://github.com/lucmiaz/PFASGroups.git
cd PFASGroups

# Install in editable mode (development install)
pip install -e .

Note for conda users: install RDKit via conda before running pip:

conda install -c conda-forge rdkit

After installation, both HalogenGroups (all halogens by default) and PFASgroups (fluorine only by default) are importable, and the pfasgroups CLI command is available in your terminal.

Verify installation

from PFASGroups import parse_smiles
results = parse_smiles("FC(F)(F)C(F)(F)C(=O)O")   # → PFASEmbeddingSet
print(results)        # prints PFASEmbeddingSet summary (molecule count, matched groups, …)
print(results[0])     # prints PFASEmbedding summary for the first molecule

Graphical User Interface (GUI)

A GUI is available to run the main commands from the module. To launch the GUI, call the following command from the PFASGroups folder:

python -m gui

Note that pyQt6 must be installed as well as pyCSRML to benefit from all features.

Binary release

To be done...

Repository Structure

PFASGroups/
├── HalogenGroups/                   # Multi-halogen wrapper package
│   └── __init__.py                  #   Wraps PFASgroups; defaults halogens=['F','Cl','Br','I']
├── PFASGroups/                      # Core implementation package (also importable as PFASgroups)
│   ├── __init__.py                  #   Public API
│   ├── core.py                      #   SMARTS matching engine, component detection, decorators
│   ├── parser.py                    #   parse_smiles / parse_mols entry points
│   ├── embeddings.py                #   presets
│   ├── PFASEmbeddings.py            #   PFASEmbeddingSet, PFASEmbedding
│   ├── HalogenGroupModel.py         #   HalogenGroup data model
│   ├── PFASDefinitionModel.py       #   PFASDefinition model
│   ├── ComponentsSolverModel.py     #   Graph-based path-finding solver
│   ├── getter.py                    #   get_HalogenGroups, get_componentSMARTSs, …
│   ├── cli.py                       #   Command-line interface (halogengroups / pfasgroups)
│   ├── prioritise.py                #   prioritise_molecules
│   ├── generate_homologues.py       #   Homologue series generation
│   ├── fragmentation.py             #   Fragment utilities
│   ├── draw_mols.py                 #   Molecule and group visualisation helpers
│   └── data/                        #   JSON configuration files (bundled with the package)
│       ├── Halogen_groups_smarts.json      # 113 halogen group definitions
│       ├── component_smarts.json           # Fluorinated component SMARTS patterns
│       ├── component_smarts_halogens.json  # Multi-halogen component SMARTS patterns
│       └── PFAS_definitions_smarts.json    # PFAS regulatory definitions
├── tests/                           # Pytest test suite (25+ test modules)
│   ├── test_halogen_groups_smarts.py
│   ├── test_results_fingerprint.py
│   ├── test_results_model.py
│   ├── test_prioritise.py
│   └── …
├── examples/                        # Standalone usage example scripts
│   ├── results_fingerprint_analysis.py
│   ├── prioritization_examples.py
│   └── …
├── docs/                            # Sphinx documentation source
│   ├── quickstart.rst
│   ├── algorithm.rst
│   ├── customization.rst
│   ├── ResultsFingerprint_Guide.md
│   └── …
├── benchmark/                       # Benchmarking scripts and timing reports
│   ├── data/
│   ├── reports/
│   └── …
└── pyproject.toml                   # Package metadata and build configuration

Two importable packages, one codebase:

Package Default halogens Typical use
HalogenGroups ['F', 'Cl', 'Br', 'I'] (all) Multi-halogen analysis
PFASgroups 'F' (fluorine only) PFAS / fluorine-focused analysis

Benchmark Summary (Mar 2026, v. 3.2.2, F groups)

Benchmarks were run on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4C/8T), 15.5 GB RAM, Python 3.9.23, RDKit 2025.09.2, NetworkX 3.2.1 (the old version of Python was taken for compatibility with PFAS-atlas).

Dataset/Profile Count Atom range PFASgroups mean/median (ms) PFAS-Atlas mean/median (ms) Relative speed Notes
OECD reference (real compounds) 3,707 varied 26.3 / 20.8 32.4 / 37.0 1.23x faster Real-world OECD dataset.
CLInventory, F only (real compounds) 28,328 varied 23.2 / 17.5 5.1 / 1.0 4.6x slower Large real-world dataset; Atlas faster due to smaller typical molecule size.
Large PFAS (≥35 atoms) 1,242 ≥35 526.0 / 253.4 93.8 / 72.5 5.6x slower Large-molecule subset; heavy-tail runtime.
Stress-test (full metrics) 2,500 11-622 271.8 / 35.9 67.0 / 38.7 4.1x slower Synthetic stress-test; heavy-tail runtime.
Stress-test (no EGR) 2,500 11-625 272.6 / 31.6 67.2 / 38.7 ≈ full Disables effective graph resistance; negligible speedup.
Stress-test (no metrics) 2,500 9-621 205.1 / 28.7 67.0 / 38.4 1.33x faster vs full Disables all component graph metrics.

Timing profile plots (full vs no EGR vs no metrics):

Disable or limit graph metrics in the Python API:

from PFASGroups import parse_smiles

smiles_list = ['C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O',
               'C1=CC(=CC=C1[N+](=O)[O-])OC2=C(C=C(C=C2)C(F)(F)F)[N+](=O)[O-]',
               'C(C(C(C(F)(F)S(=O)(=O)O)(F)F)(F)F)(C(C(F)(F)F)(F)F)(C(F)(F)F)C(F)(F)F'
                ]

# Skip all component graph metrics (fastest)
rFull = parse_smiles(smiles_list, compute_component_metrics=False)
print(rFull)
rFull.summary()

# Keep metrics but skip effective graph resistance entirely
rNoEGR = parse_smiles(smiles_list, limit_effective_graph_resistance=0)
print(rNoEGR)

# Compute resistance only for components below a size threshold
rLimitEGR = parse_smiles(smiles_list, limit_effective_graph_resistance=20)
print(rLimitEGR)

CLI equivalents:

# Skip all component graph metrics (fastest)
pfasgroups parse --no-component-metrics "C(C(F)(F)F)F"

# Skip effective graph resistance entirely
pfasgroups parse --limit-effective-graph-resistance 0 "C(C(F)(F)F)F"

# Compute resistance only for components below a size threshold
pfasgroups parse --limit-effective-graph-resistance 200 "C(C(F)(F)F)F"

Quick Start

Python API

from PFASGroups import parse_smiles

# Parse PFAS structures — returns a PFASEmbeddingSet
smiles_list = ["C[Si](C)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(=C(C(F)(F)F)C(F)(F)F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C(F)(C(F)(F)F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
        "C[Si](Cl)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "CC(=O)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)N1C(F)(F)C(F)(F)C(F)(F)C(F)(F)C1(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)CCl",
        "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "OS(=O)(=O)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)COC(=O)C=C",
        "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)OC=C",
        "FC(F)(F)C(I)(C(F)(F)F)C(F)(F)F",
        "OC(C(F)(F)C(F)(F)F)(C(F)(F)C(F)(F)F)C(F)(F)C(F)(F)F",
        "C[Si](Cl)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "CCN(CC)CC.NS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "[OH-].C[N+](C)(C)CCCNS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "[K+].[O-]S(=O)(=O)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
        "[NH4+].[O-]P(=O)(OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F)OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "OCCN(CCO)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)F",
        "OP(O)(=O)OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)N(C(F)(F)C(F)(F)C(F)(F)C(F)(F)F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "COC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)OCC=C",
        "CC(C)CCOC(=O)CC(NC(=O)C(F)(F)C(F)(F)C(F)(F)F)C(=O)OCCC(C)C",
        "CN(C(=O)C(F)(F)C(F)(F)C(F)(F)F)C(=O)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)C=CI",
        "OC(=O)C1=CC=CC=C1NC(=O)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C1(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C1(F)F"]

results = parse_smiles(smiles_list)           # → PFASEmbeddingSet
print(results)                                # summary: molecule count, top groups, …
print(results[0])                             # per-molecule summary for first entry

# Embedding from a pre-parsed set (avoids re-parsing)
arr  = results.to_array()                     # default: binary, all groups
arr  = results.to_array(group_selection='oecd', component_metrics=['binary'])
cols = results.column_names()                 # matching column labels

# Filter components by halogen, form, and saturation
results_f      = parse_smiles(smiles_list, halogens='F')  # Fluorine only
results_pfa    = parse_smiles(smiles_list, halogens='F', saturation='per', form='alkyl')  # Perfluoroalkyl only
results_cyclic = parse_smiles(smiles_list, form='cyclic')  # Cyclic forms only

# Dimensionality reduction (methods on PFASEmbeddingSet)
pca_result  = results.perform_pca(n_components=5, plot=True)
tsne_result = results.perform_tsne(perplexity=30, plot=True)
umap_result = results.perform_umap(n_neighbors=15, plot=True)

# Colour each point by the PFAS group with the highest match count
pca_result  = results.perform_pca(n_components=5,  color_by='top_group')
tsne_result = results.perform_tsne(perplexity=30,   color_by='top_group')
umap_result = results.perform_umap(n_neighbors=15,  color_by='top_group')

# Or supply your own per-molecule labels (e.g. from external metadata)
labels = ['industrial'] * 10 + ['environmental'] * (len(smiles_list) - 10)
pca_result  = results.perform_pca(color_by=labels)

# Compare two datasets using KL divergence
other_smiles_list = smiles_list[0:10] + \
    ["OC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(O)=O",
     "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
     "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)COCC=C"]
other_results = parse_smiles(other_smiles_list)
similarity = results.compare_kld(other_results, method='minmax')
print(f"Similarity between the lists of compounds: {similarity:2.3f}")

# Save to SQL database
results.to_sql(filename='results.db')

# Prioritization tool for screening and ranking
from PFASGroups import prioritise_molecules

# Prioritize by similarity to reference list (e.g., known persistent PFAS)
reference = ["FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O"]
prioritised, scores = prioritise_molecules(
    molecules=smiles_list,
    reference=reference,
    return_scores=True
)
print(f"Prioritisation order: {'\n'.join([x['smiles'] for x in prioritised])}")

# Prioritize by intrinsic fluorination properties (long chains, high F content)
prioritised = prioritise_molecules(
    molecules=smiles_list,
    a=1.0,  # Weight for total fluorination
    b=5.0,  # Weight for longest chains
    percentile=90,  # Focus on 90th percentile of component sizes
    return_scores = False
)
print(f"Prioritisation order: {'\n'.join([x['smiles'] for x in prioritised])}")

Command Line

# Parse SMILES strings
pfasgroups parse "C(C(F)(F)F)F" "FC(F)(F)C(F)(F)C(=O)O"

# Generate fingerprints
pfasgroups fingerprint "C(C(F)(F)F)F" --format dict

# List available PFAS groups
pfasgroups list-groups

Filtering Components by Halogen, Form, and Saturation

Filter component matches by specific halogens, chemical forms, or saturation levels:

from HalogenGroups import parse_smiles

smiles_list = ["C(C(F)(F)F)F", "FC(F)(F)C(F)(F)C(=O)O", "C(C(Cl)(Cl)Cl)Cl"]

# Filter only fluorine components
results_f = parse_smiles(smiles_list, halogens='F')

# Filter perfluorinated alkyl compounds
results_pfa = parse_smiles(smiles_list, halogens='F', saturation='per', form='alkyl')

# Filter polyfluorinated cyclic compounds
results_polyf_cyclic = parse_smiles(smiles_list, halogens='F', saturation='poly', form='cyclic')

# Filter multiple halogens (F and Cl)
results_multi = parse_smiles(smiles_list, halogens=['F', 'Cl'])

# Valid filter options:
# - halogens: 'F', 'Cl', 'Br', 'I', or list like ['F', 'Cl']
# - saturation: 'per' or 'poly' (or list like ['per', 'poly'] for both)
# - form: 'alkyl' or 'cyclic' (or list like ['alkyl', 'cyclic'] for both)

Embedding with Graph Metrics

The to_array() / to_fingerprint() methods accept a component_metrics list that stacks one block of N_G columns per metric (default N_G = 114 for fluorine-only).

import numpy as np
from PFASGroups import parse_smiles, EMBEDDING_PRESETS

smiles = [
    "O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",  # PFOA
    "O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",          # PFHpA
    "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)F",                              # 4:2 FTOH
    "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",               # 6:2 FTOH
]
results = parse_smiles(smiles)

# --- binary embedding (default) ---
arr_bin = results.to_array()                    # shape (4, 114)

# --- preset 'best': binary + effective_graph_resistance ---
# Best discrimination (mean Tanimoto 0.184, outperforms TxP-PFAS 129-bit)
print(EMBEDDING_PRESETS['best']['description'])
arr_best = results.to_array(preset='best')      # shape (4, 228) = 114 × 2

print(arr_best.shape)

# --- effective graph resistance directly ---
arr_egr = results.to_array(
    component_metrics=['binary', 'effective_graph_resistance']
)                                               # shape (4, 228)

print(arr_egr.shape)

# --- n_spacer: telomer CH₂ spacer length (encodes 'm' in 'm:n' notation) ---
# Zero for non-telomers; distinguishes 2:1, 4:2, 6:2 FTOHs in a single column
arr_ns = results.to_array(component_metrics=['n_spacer'])  # shape (4, 114)

print(arr_egr.ns)

# --- ring_size: smallest ring overlapping each matched component ---
# Zero for acyclic groups; 5 for azoles/furans; 6 for benzene/cyclohexane
arr_rs = results.to_array(component_metrics=['ring_size'])  # shape (4, 114)

print(arr_rs.shape)

# --- combined: EGR + n_spacer + ring_size + molecule-wide descriptors ---
arr_combined = results.to_array(
    component_metrics=['binary', 'effective_graph_resistance',
                       'n_spacer', 'ring_size'],
    molecule_metrics=['n_components', 'max_size',
                      'mean_branching', 'max_branching',
                      'mean_component_fraction', 'max_component_fraction'],
)  # shape (4, 4*114 + 6) = (4, 462)

print(arr_combined.shape)

# --- multi-halogen: one parse per halogen, then hstack ---
halogens = ['F','Cl','Br','I']
r_hal = parse_smiles(smiles, halogens=halogens)
arr_4x = r_hal.to_array(component_metrics=['effective_graph_resistance'], molecule_metrics=['n_components','max_size','mean_branching'], halogens=halogens)

print(arr_4x.shape)

# --- column names ---
cols = results.column_names(
    component_metrics=['binary', 'effective_graph_resistance']
)
print(cols[:4])   # e.g. ['Perfluoroalkyl [binary]', ..., 'Perfluoroalkyl [EGR]', ...]

print(arr_4x[:4])

See examples/embedding_with_graph_metrics.py for a complete runnable script covering all options.

Multi-Halogen Analysis

PFASGroups supports fluorine, chlorine, bromine, and iodine. There are two ways to analyse all halogens at once:

Option A – import HalogenGroups (all halogens by default)

from HalogenGroups import parse_smiles

smiles_list = [
    "C(C(F)(F)F)F",          # fluorinated
    "ClC(Cl)(Cl)C(Cl)(Cl)Cl", # chlorinated
    "BrC(Br)(Br)CBr",         # brominated
]

# halogens defaults to ['F','Cl','Br','I'] — no extra argument needed
results = parse_smiles(smiles_list)           # → PFASEmbeddingSet

# to_array() reads the halogen info already captured during parsing
arr  = results.to_array(group_selection='oecd', component_metrics=['binary'], halogens = ['F','Cl','Br','I'])
cols = results.column_names(group_selection='oecd')

print(arr.shape)

Option B – import PFASgroups and specify halogens explicitly

from PFASGroups import parse_smiles

smiles_list = [
    "C(C(F)(F)F)F",
    "ClC(Cl)(Cl)C(Cl)(Cl)Cl",
    "BrC(Br)(Br)CBr",
]

# Explicitly pass all halogens
results = parse_smiles(smiles_list, halogens=['F', 'Cl', 'Br', 'I'])  # → PFASEmbeddingSet

arr  = results.to_array(group_selection='oecd', component_metrics=['binary'], halogens=['F', 'Cl', 'Br', 'I'])
cols = results.column_names(group_selection='oecd')

Custom Configuration

Use custom pathtype definitions and PFAS groups:

# Load custom files entirely
from PFASGroups import get_componentSMARTSs, get_HalogenGroups, parse_smiles
from PFASGroups.core import preprocess_componentsSmarts
from PFASGroups.parser import preprocess_HalogenGroups

# You can load component smarts
custom_paths = get_componentSMARTSs()
# Add or modify componentSMARTSs as you want
# ....
# then preprocess them
custom_paths = preprocess_componentsSmarts(custom_paths)


# You can load HalogenGroups
custom_groups = get_HalogenGroups()
# Add or modify HalogenGroups as you want
# ....
# then preprocess them
pfas_groups, agg_groups = preprocess_HalogenGroups(custom_groups)


results = parse_smiles(
    ["C(C(F)(F)F)F"],
    componentSmartss=custom_paths,
    pfas_groups=pfas_groups,
    agg_pfas_groups = agg_groups
)
# Or extend defaults with your custom groups
from PFASGroups import get_compiled_HalogenGroups, HalogenGroup, parse_smiles, get_componentSMARTSs
from PFASGroups.core import preprocess_componentsSmarts

# Add custom PFAS groups
groups = get_compiled_HalogenGroups()  # Get defaults, precompiled
# Create a new group
custom_group = HalogenGroup(
    id=999,
    name="My Custom Group",
    smarts={"[C](F)(F)F": 1},
    componentSmarts=None,
    componentSaturation="per",
    componentHalogens="F",
    componentForm="alkyl",
    constraints={"gte": {"F": 1}}
)

# then add it to the lit of groups
groups.append(custom_group)

# pass the new set of groups as argument to parse_smiles
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=groups)

# Alternatively, select and modify some existing default_groups
from PFASGroups import get_HalogenGroups, HalogenGroup
# get and subset the default groups
default_groups = get_HalogenGroups()[2:4]
# extract one group
modified_group = default_groups[1]
# modify its parameters, here simply require presence of at least two structures
modified_group['smarts']={x:2 for x, i in modified_group['smarts'].items()}
# replace the modified group in the groups list
default_groups[1] = modified_group
# compile the groups
compiled_groups = [HalogenGroup(**x) for x in default_groups]

# Possibly add another group with max_dist_from_comp = 3
compiled_groups.append(HalogenGroup(
    id=998,
    name="Extended Distance Group",
    smarts={"[#6$([#6][OH1])]": 1},
    componentSmarts=None,
    constraints={},
    max_dist_from_comp=3  # Allow up to 3 bonds from fluorinated carbon
))

# parse the groups:
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=compiled_groups)

# Add or modify custom path types (e.g., chlorinated analogs)
paths = get_componentSMARTSs()
paths['Cl']['alkyl'] = {"per":{
                "component":"[C;X4](Cl)(Cl)!@!=!#[C;X4](Cl)(Cl)",
                "name":"Custom Chloro"}}
paths = preprocess_componentsSmarts(paths)

results = parse_smiles(["ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"], halogens = 'Cl', componentSmartss=paths)

Documentation

Usage Examples

See USER_GUIDE.md for comprehensive examples including:

  • Basic PFAS parsing and analysis
  • Fingerprint generation for machine learning
  • Custom configuration files
  • Batch processing
  • Integration with pandas and scikit-learn

Summary of changes by version

  • Version 3.2.2: Fixed polyhalogenated alkyl components matching less than 2 halogens. Added option to pass 'halogens' to formula constraints. Added options to include component-wide formula constraints (on dist-1 neighbours from matched C-only components).

  • Version 3.2.1: Added n-spacer metric for telomers and ring size for aryl and cyclic groups. These can be used in the embeddings.

  • Version 3.2.0: Use of BDE for computing graph resistance.

  • Version 3.1.4: Changed fingerprint parameters to molecule wide and component wide, with pre-configuration for best combinations.

  • Version 3.1.0: Added support for other halogens, changed names to be more generic (with some support for backward compatibility). Added component smarts for other halogens, cyclic and alkyl components.

  • Version 2.2.4 (Feb 2026): Advanced fingerprint analysis with dimensionality reduction (PCA, kernel-PCA, t-SNE, UMAP), KL divergence comparison for dataset similarity assessment, and SQL persistence for results. Added molecule prioritization tool for screening applications, ranking by similarity to reference lists or by intrinsic fluorination properties. Introduced PFASEmbeddingSet and PFASEmbedding with comprehensive analysis methods, automated plot generation, and extensive documentation.

  • Version 2.2.3 (Feb 2026): Added PFASEmbeddingSet container to offer easier plotting and summarising capabilities for results.

  • Version 2.2 (Feb 2026): Added linked_smarts option to specify a restriction on path between smarts groups and fluorinated component. Added new PFASgroups (telomers). v2.2.2 Fixed telomers and added examples and counter-examples to each PFASgroup. Removed boundary O in fluorinated components (for both Per and Polyalkyl components).

  • Version 2.1 (Jan 2026): Added support for multiple smarts, with individual minimum count, per PFASgroup.

  • Version 2.0 (Jan 2026): Major expansion of graph‑based component metrics, new coverage statistics, schema updates, and richer per‑component outputs.

Version 2.2.4 (February 2026) - Advanced Fingerprint Analysis

Major enhancement adding comprehensive dimensionality reduction and statistical comparison capabilities:

New Features:

  • PFASEmbeddingSet / PFASEmbedding: Unified result container with flexible embedding generation
  • Dimensionality Reduction: PCA, kernel-PCA, t-SNE, and UMAP with automatic plotting
  • Statistical Comparison: KL divergence for comparing dataset compositions
  • Database Persistence: SQL save/load for results
  • Molecule Prioritization: Screening and ranking tool for PFAS datasets
  • Comprehensive Documentation: Complete API reference, examples, and 100+ tests

Key Methods:

  • PFASEmbeddingSet.to_array(): Generate numeric embedding matrix from parsed results
  • PFASEmbeddingSet.column_names(): Column labels matching to_array() output
  • PFASEmbeddingSet.perform_pca(): Principal Component Analysis
  • PFASEmbeddingSet.perform_kernel_pca(): Non-linear kernel PCA
  • PFASEmbeddingSet.perform_tsne(): t-SNE visualization
  • PFASEmbeddingSet.perform_umap(): Fast UMAP dimensionality reduction
  • PFASEmbeddingSet.compare_kld(): Dataset similarity via KL divergence
  • prioritise_molecules(): Rank molecules by similarity or fluorination properties
  • PFASEmbeddingSet.to_sql(): Persist results to a SQLite/PostgreSQL database

Prioritization Strategies:

  • Reference-based: Rank by distributional similarity to known PFAS (e.g., persistent chemicals)
  • Intrinsic properties: Score by fluorination characteristics (total F content, chain length)
  • Flexible weighting: Tune parameters for different screening objectives

Use Cases:

  • Exploratory data analysis of PFAS inventories
  • Database comparison and compositional analysis
  • Cluster identification and pattern recognition
  • Machine learning preprocessing
  • Chemical space visualization
  • Priority screening for environmental monitoring
  • Regulatory watchlist generation

See docs/ResultsFingerprint_Guide.md, docs/prioritization.rst, examples/results_fingerprint_analysis.py, and examples/prioritization_examples.py for details.

Version 2.2.3 (February 2026) - PFASEmbeddingSet Container

New Features:

  • PFASEmbeddingSet container with visualization helpers
  • Enhanced component plotting utilities
  • Improved documentation and examples

Version 2.0 (January 2026) - Comprehensive Graph Metrics

Major enhancement adding comprehensive NetworkX graph theory metrics for detailed component analysis:

New Features:

  • Component-Level Metrics: Each fluorinated component now includes 15+ graph metrics:
    • diameter and radius - Graph eccentricity bounds
    • center, periphery, barycenter - Structural node sets
    • effective_graph_resistance - Sum of resistance distances
    • component_fraction - Fraction of molecule covered by component (includes all attached H, F, Cl, Br, I)
    • Distance metrics from functional groups to structural features
  • Molecular Coverage Metrics: New fraction-based metrics quantify fluorination extent:
    • mean_component_fraction - Average coverage per component
    • total_components_fraction - Total coverage by union of all components (accounts for overlaps)
  • Summary Statistics: Aggregated metrics across all components per PFAS group
  • Enhanced Database Models: New Components model stores individual component data with all metrics
  • Improved Analysis: Better understanding of molecular topology, branching, functional group positioning, and fluorination extent

Breaking Changes:

  • parse_mols output now includes additional summary metric fields (mean_diameter, mean_radius, etc.)
  • Database schema changes require migration (see DATABASE_MIGRATION_GUIDE.md)

Metrics Explained:

  • branching (0-1): Measures linearity (1.0 = linear, 0.0 = highly branched) - renamed from "eccentricity"
  • mean_eccentricity, median_eccentricity: Graph-theoretic eccentricity statistics for component nodes
  • smarts_centrality (0-1): Functional group position (1.0 = central, 0.0 = peripheral)
  • n_spacer (int ≥ 0): Fluorotelomer CH₂ spacer length — the "m" in "m:n" telomer notation; 0 for all non-telomeric groups
  • ring_size (int ≥ 0): Smallest ring overlapping the matched component; 0 for acyclic chains, 5 for azoles/furans, 6 for benzene/cyclohexane derivatives
  • component_fraction (0-1): Fraction of total molecule atoms in this component (includes all attached atoms)
  • total_components_fraction (0-1): Fraction of molecule covered by union of all components
  • diameter: Maximum distance between any two atoms in component
  • radius: Minimum eccentricity across component nodes
  • barycenter: Nodes minimizing total distance to all other nodes
  • center: Nodes with minimum eccentricity
  • periphery: Nodes with maximum eccentricity

See COMPREHENSIVE_METRICS_SUMMARY.md for complete documentation.

  • Version 1.x: Shift to component‑based analysis with improved SMARTS matching and better handling of branched/cyclic structures.

Version 1.x - Component-Based Analysis

  • Replaced chain-finding with connected component analysis
  • Added support for branched and cyclic structures
  • Improved SMARTS pattern matching for diverse PFAS classes

Version 0.x - Path-Based Analysis

  • Find SMARTS match connected to either a second SMARTS or a default path-related SMARTS using networkx shortest_path.

Licence

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Contact me in case you want an exception to the No Derivatives term.

Acknowledgments

This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo zeropm logozeropm logo

Powered by RDKit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pfasgroups-3.3.1.tar.gz (287.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pfasgroups-3.3.1-py3-none-any.whl (238.5 kB view details)

Uploaded Python 3

File details

Details for the file pfasgroups-3.3.1.tar.gz.

File metadata

  • Download URL: pfasgroups-3.3.1.tar.gz
  • Upload date:
  • Size: 287.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pfasgroups-3.3.1.tar.gz
Algorithm Hash digest
SHA256 7ea2581e6b42c0d038f70da8c9dda16428c508bd2500d1c2f5cc117028043a08
MD5 2576d5b51e2f7a3aff463f8c33e52ed6
BLAKE2b-256 3f0bd9c25bc30a3025eef881f19b2a810358b30cc4931b3a1b94e907f8e477f1

See more details on using hashes here.

File details

Details for the file pfasgroups-3.3.1-py3-none-any.whl.

File metadata

  • Download URL: pfasgroups-3.3.1-py3-none-any.whl
  • Upload date:
  • Size: 238.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pfasgroups-3.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1e9d3b78f7393093614c103951617d7d6e62ed702afaa7c998241690e3582de
MD5 912769bdca82a284c4fd3c6e2c14fafe
BLAKE2b-256 5a6e139f398e85839452f1ab5cbc5dab36336a82386c7e39914db90547847b6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page