A comprehensive cheminformatics package for automated detection, classification, and analysis of halogenated substances, with a focus on per- and polyfluoroalkyl substances (PFAS). Combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding to identify 119 groups. Supports creations of embeddings with graph metrics for Machine Learning workflows.

These details have not been verified by PyPI

Project links

Project description

PFASGroups

A comprehensive cheminformatics package for automated detection, classification, and analysis of halogenated substances, in particular per- and polyfluoroalkyl substances (PFAS) in chemical databases.

Overview

PFASGroups combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding (using RDKit and NetworkX) to identify and classify PFAS compounds. The package enables systematic PFAS universe mapping and environmental monitoring applications.

Key Features

Core Capabilities

Halogen Group Identification: Automated detection of 119 functional groups (114 compiled for fluorine-only embedding):
- 27 PFAS OECD groups
- 48 generic functional groups (IDs 29–73, 117–119; the last 3 are halogen-context-dependent or recently added)
- 43 fluorotelomer-specific groups with CH₂ linker validation
- 1 aggregate pattern-matching group (Group 116: Telomers, compute=False)
Atom Reference Requirement: For non-telomer groups, SMARTS patterns must match atoms that are part of or directly connected to the fluorinated component (per/polyfluorinated carbons), respecting the max_dist_from_comp constraint
Linker Validation: CH₂-specific validation for 40 fluorotelomer groups to distinguish from direct-attachment analogues. Telomer groups use linker_smarts to allow functional groups separated from perfluoro chains by non-fluorinated linkers
Aggregate Groups: Pattern-matching groups that collect related PFAS groups via regex (e.g., Group 113 matches all "telomer" groups)
Component Length Analysis: Quantification of per- and polyfluorinated alkyl components with CF₂ unit counting
Graph Metrics: Comprehensive structural characterization (branching, eccentricity, diameter, resistance, centrality, telomer spacer length, ring size)
Customizable Definitions: Easy extension to additional PFAS groups and halogenated chemical classes via JSON configuration. Component type definitions in data/component_smarts_halogens.json support optional per-type constraints (e.g. {"gte": {"F": 2}}) that are evaluated against the component's full atom count (backbone carbons plus attached halogens) to enforce minimum halogen counts or element exclusions — without repeating the check in every group definition

Additional Tools

Homologue Series Generation: Iterative component shortening to explore theoretical chemical space
Fingerprint Generation: PFAS fingerprints for machine learning applications
Visualization: Assign and visualize PFAS groupings
Multiple Interfaces: Python API, command-line tool, and browser-based JavaScript version (RDKitJS)
Batch Processing: Efficient analysis of large chemical databases

Installation

The recommended installation method is via pip. RDKit must already be available in the target environment. It is recommended to use an environment manager (like Conda/Mamba, e.g Miniforge) and install RDKit via

mamba install -y -c rdkit rdkit

From PyPI

PFASGroups is available on PyPI:

pip install PFASGroups

From source (recommended for development)

Prerequisites: Python ≥ 3.7, RDKit (install via conda or pip before the steps below).

# Clone the repository
git clone https://github.com/lucmiaz/PFASGroups.git
cd PFASGroups

# Install in editable mode (development install)
pip install -e .

Note for conda users: install RDKit via conda before running pip:
conda install -c conda-forge rdkit

After installation, both HalogenGroups (all halogens by default) and PFASgroups (fluorine only by default) are importable, and the pfasgroups CLI command is available in your terminal.

Verify installation

from PFASGroups import parse_smiles
results = parse_smiles("FC(F)(F)C(F)(F)C(=O)O")   # → PFASEmbeddingSet
print(results)        # prints PFASEmbeddingSet summary (molecule count, matched groups, …)
print(results[0])     # prints PFASEmbedding summary for the first molecule

Graphical User Interface (GUI)

A GUI is available to run the main commands from the module. To launch the GUI, call the following command from the PFASGroups folder:

python -m gui

Note that pyQt6 must be installed as well as pyCSRML to benefit from all features.

Binary release

To be done...

Repository Structure

PFASGroups/
├── HalogenGroups/                   # Multi-halogen wrapper package
│   └── __init__.py                  #   Wraps PFASgroups; defaults halogens=['F','Cl','Br','I']
├── PFASGroups/                      # Core implementation package (also importable as PFASgroups)
│   ├── __init__.py                  #   Public API
│   ├── core.py                      #   SMARTS matching engine, component detection, decorators
│   ├── parser.py                    #   parse_smiles / parse_mols entry points
│   ├── embeddings.py                #   presets
│   ├── PFASEmbeddings.py            #   PFASEmbeddingSet, PFASEmbedding
│   ├── HalogenGroupModel.py         #   HalogenGroup data model
│   ├── PFASDefinitionModel.py       #   PFASDefinition model
│   ├── ComponentsSolverModel.py     #   Graph-based path-finding solver
│   ├── getter.py                    #   get_HalogenGroups, get_componentSMARTSs, …
│   ├── cli.py                       #   Command-line interface (halogengroups / pfasgroups)
│   ├── prioritise.py                #   prioritise_molecules
│   ├── generate_homologues.py       #   Homologue series generation
│   ├── fragmentation.py             #   Fragment utilities
│   ├── draw_mols.py                 #   Molecule and group visualisation helpers
│   └── data/                        #   JSON configuration files (bundled with the package)
│       ├── Halogen_groups_smarts.json      # 113 halogen group definitions
│       ├── component_smarts.json           # Fluorinated component SMARTS patterns
│       ├── component_smarts_halogens.json  # Multi-halogen component SMARTS patterns
│       └── PFAS_definitions_smarts.json    # PFAS regulatory definitions
├── tests/                           # Pytest test suite (25+ test modules)
│   ├── test_halogen_groups_smarts.py
│   ├── test_results_fingerprint.py
│   ├── test_results_model.py
│   ├── test_prioritise.py
│   └── …
├── examples/                        # Standalone usage example scripts
│   ├── results_fingerprint_analysis.py
│   ├── prioritization_examples.py
│   └── …
├── docs/                            # Sphinx documentation source
│   ├── quickstart.rst
│   ├── algorithm.rst
│   ├── customization.rst
│   ├── ResultsFingerprint_Guide.md
│   └── …
├── benchmark/                       # Benchmarking scripts and timing reports
│   ├── data/
│   ├── reports/
│   └── …
└── pyproject.toml                   # Package metadata and build configuration

Two importable packages, one codebase:

Package	Default `halogens`	Typical use
`HalogenGroups`	`['F', 'Cl', 'Br', 'I']` (all)	Multi-halogen analysis
`PFASgroups`	`'F'` (fluorine only)	PFAS / fluorine-focused analysis

Benchmark Summary (Mar 2026, v. 3.2.2, F groups)

Benchmarks were run on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4C/8T), 15.5 GB RAM, Python 3.9.23, RDKit 2025.09.2, NetworkX 3.2.1 (the old version of Python was taken for compatibility with PFAS-atlas).

Dataset/Profile	Count	Atom range	PFASgroups mean/median (ms)	PFAS-Atlas mean/median (ms)	Relative speed	Notes
OECD reference (real compounds)	3,707	varied	26.3 / 20.8	32.4 / 37.0	1.23x faster	Real-world OECD dataset.
CLInventory, F only (real compounds)	28,328	varied	23.2 / 17.5	5.1 / 1.0	4.6x slower	Large real-world dataset; Atlas faster due to smaller typical molecule size.
Large PFAS (≥35 atoms)	1,242	≥35	526.0 / 253.4	93.8 / 72.5	5.6x slower	Large-molecule subset; heavy-tail runtime.
Stress-test (full metrics)	2,500	11-622	271.8 / 35.9	67.0 / 38.7	4.1x slower	Synthetic stress-test; heavy-tail runtime.
Stress-test (no EGR)	2,500	11-625	272.6 / 31.6	67.2 / 38.7	≈ full	Disables effective graph resistance; negligible speedup.
Stress-test (no metrics)	2,500	9-621	205.1 / 28.7	67.0 / 38.4	1.33x faster vs full	Disables all component graph metrics.

Timing profile plots (full vs no EGR vs no metrics):

benchmark/imgs/si/timing_profiles_comparison.png

Disable or limit graph metrics in the Python API:

from PFASGroups import parse_smiles

smiles_list = ['C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O',
               'C1=CC(=CC=C1[N+](=O)[O-])OC2=C(C=C(C=C2)C(F)(F)F)[N+](=O)[O-]',
               'C(C(C(C(F)(F)S(=O)(=O)O)(F)F)(F)F)(C(C(F)(F)F)(F)F)(C(F)(F)F)C(F)(F)F'
                ]

# Skip all component graph metrics (fastest)
rFull = parse_smiles(smiles_list, compute_component_metrics=False)
print(rFull)
rFull.summary()

# Keep metrics but skip effective graph resistance entirely
rNoEGR = parse_smiles(smiles_list, limit_effective_graph_resistance=0)
print(rNoEGR)

# Compute resistance only for components below a size threshold
rLimitEGR = parse_smiles(smiles_list, limit_effective_graph_resistance=20)
print(rLimitEGR)

CLI equivalents:

# Skip all component graph metrics (fastest)
pfasgroups parse --no-component-metrics "C(C(F)(F)F)F"

# Skip effective graph resistance entirely
pfasgroups parse --limit-effective-graph-resistance 0 "C(C(F)(F)F)F"

# Compute resistance only for components below a size threshold
pfasgroups parse --limit-effective-graph-resistance 200 "C(C(F)(F)F)F"

Quick Start

Python API

from PFASGroups import parse_smiles

# Parse PFAS structures — returns a PFASEmbeddingSet
smiles_list = ["C[Si](C)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(=C(C(F)(F)F)C(F)(F)F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C(F)(C(F)(F)F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
        "C[Si](Cl)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "CC(=O)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)N1C(F)(F)C(F)(F)C(F)(F)C(F)(F)C1(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)CCl",
        "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "OS(=O)(=O)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)COC(=O)C=C",
        "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)OC=C",
        "FC(F)(F)C(I)(C(F)(F)F)C(F)(F)F",
        "OC(C(F)(F)C(F)(F)F)(C(F)(F)C(F)(F)F)C(F)(F)C(F)(F)F",
        "C[Si](Cl)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "CCN(CC)CC.NS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "[OH-].C[N+](C)(C)CCCNS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "[K+].[O-]S(=O)(=O)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
        "[NH4+].[O-]P(=O)(OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F)OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "OCCN(CCO)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)F",
        "OP(O)(=O)OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)N(C(F)(F)C(F)(F)C(F)(F)C(F)(F)F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "COC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)OCC=C",
        "CC(C)CCOC(=O)CC(NC(=O)C(F)(F)C(F)(F)C(F)(F)F)C(=O)OCCC(C)C",
        "CN(C(=O)C(F)(F)C(F)(F)C(F)(F)F)C(=O)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C(F)(F)C(F)(F)C=CI",
        "OC(=O)C1=CC=CC=C1NC(=O)C(F)(F)C(F)(F)C(F)(F)F",
        "FC(F)(F)C1(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C1(F)F"]

results = parse_smiles(smiles_list)           # → PFASEmbeddingSet
print(results)                                # summary: molecule count, top groups, …
print(results[0])                             # per-molecule summary for first entry

# Embedding from a pre-parsed set (avoids re-parsing)
arr  = results.to_array()                     # default: binary, all groups
arr  = results.to_array(group_selection='oecd', component_metrics=['binary'])
cols = results.column_names()                 # matching column labels

# Filter components by halogen, form, and saturation
results_f      = parse_smiles(smiles_list, halogens='F')  # Fluorine only
results_pfa    = parse_smiles(smiles_list, halogens='F', saturation='per', form='alkyl')  # Perfluoroalkyl only
results_cyclic = parse_smiles(smiles_list, form='cyclic')  # Cyclic forms only

# Dimensionality reduction (methods on PFASEmbeddingSet)
pca_result  = results.perform_pca(n_components=5, plot=True)
tsne_result = results.perform_tsne(perplexity=30, plot=True)
umap_result = results.perform_umap(n_neighbors=15, plot=True)

# Colour each point by the PFAS group with the highest match count
pca_result  = results.perform_pca(n_components=5,  color_by='top_group')
tsne_result = results.perform_tsne(perplexity=30,   color_by='top_group')
umap_result = results.perform_umap(n_neighbors=15,  color_by='top_group')

# Or supply your own per-molecule labels (e.g. from external metadata)
labels = ['industrial'] * 10 + ['environmental'] * (len(smiles_list) - 10)
pca_result  = results.perform_pca(color_by=labels)

# Compare two datasets using KL divergence
other_smiles_list = smiles_list[0:10] + \
    ["OC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(O)=O",
     "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
     "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)COCC=C"]
other_results = parse_smiles(other_smiles_list)
similarity = results.compare_kld(other_results, method='minmax')
print(f"Similarity between the lists of compounds: {similarity:2.3f}")

# Save to SQL database
results.to_sql(filename='results.db')

# Prioritization tool for screening and ranking
from PFASGroups import prioritise_molecules

# Prioritize by similarity to reference list (e.g., known persistent PFAS)
reference = ["FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O"]
prioritised, scores = prioritise_molecules(
    molecules=smiles_list,
    reference=reference,
    return_scores=True
)
print(f"Prioritisation order: {'\n'.join([x['smiles'] for x in prioritised])}")

# Prioritize by intrinsic fluorination properties (long chains, high F content)
prioritised = prioritise_molecules(
    molecules=smiles_list,
    a=1.0,  # Weight for total fluorination
    b=5.0,  # Weight for longest chains
    percentile=90,  # Focus on 90th percentile of component sizes
    return_scores = False
)
print(f"Prioritisation order: {'\n'.join([x['smiles'] for x in prioritised])}")

Command Line

# Parse SMILES strings
pfasgroups parse "C(C(F)(F)F)F" "FC(F)(F)C(F)(F)C(=O)O"

# Generate fingerprints
pfasgroups fingerprint "C(C(F)(F)F)F" --format dict

# List available PFAS groups
pfasgroups list-groups

Filtering Components by Halogen, Form, and Saturation

Filter component matches by specific halogens, chemical forms, or saturation levels:

from HalogenGroups import parse_smiles

smiles_list = ["C(C(F)(F)F)F", "FC(F)(F)C(F)(F)C(=O)O", "C(C(Cl)(Cl)Cl)Cl"]

# Filter only fluorine components
results_f = parse_smiles(smiles_list, halogens='F')

# Filter perfluorinated alkyl compounds
results_pfa = parse_smiles(smiles_list, halogens='F', saturation='per', form='alkyl')

# Filter polyfluorinated cyclic compounds
results_polyf_cyclic = parse_smiles(smiles_list, halogens='F', saturation='poly', form='cyclic')

# Filter multiple halogens (F and Cl)
results_multi = parse_smiles(smiles_list, halogens=['F', 'Cl'])

# Valid filter options:
# - halogens: 'F', 'Cl', 'Br', 'I', or list like ['F', 'Cl']
# - saturation: 'per' or 'poly' (or list like ['per', 'poly'] for both)
# - form: 'alkyl' or 'cyclic' (or list like ['alkyl', 'cyclic'] for both)

Embedding with Graph Metrics

The to_array() / to_fingerprint() methods accept a component_metrics list that stacks one block of N_G columns per metric (default N_G = 114 for fluorine-only).

import numpy as np
from PFASGroups import parse_smiles, EMBEDDING_PRESETS

smiles = [
    "O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",  # PFOA
    "O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",          # PFHpA
    "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)F",                              # 4:2 FTOH
    "OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",               # 6:2 FTOH
]
results = parse_smiles(smiles)

# --- binary embedding (default) ---
arr_bin = results.to_array()                    # shape (4, 114)

# --- preset 'best': binary + effective_graph_resistance ---
# Best discrimination (mean Tanimoto 0.184, outperforms TxP-PFAS 129-bit)
print(EMBEDDING_PRESETS['best']['description'])
arr_best = results.to_array(preset='best')      # shape (4, 228) = 114 × 2

print(arr_best.shape)

# --- effective graph resistance directly ---
arr_egr = results.to_array(
    component_metrics=['binary', 'effective_graph_resistance']
)                                               # shape (4, 228)

print(arr_egr.shape)

# --- n_spacer: telomer CH₂ spacer length (encodes 'm' in 'm:n' notation) ---
# Zero for non-telomers; distinguishes 2:1, 4:2, 6:2 FTOHs in a single column
arr_ns = results.to_array(component_metrics=['n_spacer'])  # shape (4, 114)

print(arr_egr.ns)

# --- ring_size: smallest ring overlapping each matched component ---
# Zero for acyclic groups; 5 for azoles/furans; 6 for benzene/cyclohexane
arr_rs = results.to_array(component_metrics=['ring_size'])  # shape (4, 114)

print(arr_rs.shape)

# --- combined: EGR + n_spacer + ring_size + molecule-wide descriptors ---
arr_combined = results.to_array(
    component_metrics=['binary', 'effective_graph_resistance',
                       'n_spacer', 'ring_size'],
    molecule_metrics=['n_components', 'max_size',
                      'mean_branching', 'max_branching',
                      'mean_component_fraction', 'max_component_fraction'],
)  # shape (4, 4*114 + 6) = (4, 462)

print(arr_combined.shape)

# --- multi-halogen: one parse per halogen, then hstack ---
halogens = ['F','Cl','Br','I']
r_hal = parse_smiles(smiles, halogens=halogens)
arr_4x = r_hal.to_array(component_metrics=['effective_graph_resistance'], molecule_metrics=['n_components','max_size','mean_branching'], halogens=halogens)

print(arr_4x.shape)

# --- column names ---
cols = results.column_names(
    component_metrics=['binary', 'effective_graph_resistance']
)
print(cols[:4])   # e.g. ['Perfluoroalkyl [binary]', ..., 'Perfluoroalkyl [EGR]', ...]

print(arr_4x[:4])

See examples/embedding_with_graph_metrics.py for a complete runnable script covering all options.

Multi-Halogen Analysis

PFASGroups supports fluorine, chlorine, bromine, and iodine. There are two ways to analyse all halogens at once:

Option A – import `HalogenGroups` (all halogens by default)

from HalogenGroups import parse_smiles

smiles_list = [
    "C(C(F)(F)F)F",          # fluorinated
    "ClC(Cl)(Cl)C(Cl)(Cl)Cl", # chlorinated
    "BrC(Br)(Br)CBr",         # brominated
]

# halogens defaults to ['F','Cl','Br','I'] — no extra argument needed
results = parse_smiles(smiles_list)           # → PFASEmbeddingSet

# to_array() reads the halogen info already captured during parsing
arr  = results.to_array(group_selection='oecd', component_metrics=['binary'], halogens = ['F','Cl','Br','I'])
cols = results.column_names(group_selection='oecd')

print(arr.shape)

Option B – import `PFASgroups` and specify `halogens` explicitly

from PFASGroups import parse_smiles

smiles_list = [
    "C(C(F)(F)F)F",
    "ClC(Cl)(Cl)C(Cl)(Cl)Cl",
    "BrC(Br)(Br)CBr",
]

# Explicitly pass all halogens
results = parse_smiles(smiles_list, halogens=['F', 'Cl', 'Br', 'I'])  # → PFASEmbeddingSet

arr  = results.to_array(group_selection='oecd', component_metrics=['binary'], halogens=['F', 'Cl', 'Br', 'I'])
cols = results.column_names(group_selection='oecd')

Custom Configuration

Use custom pathtype definitions and PFAS groups:

# Load custom files entirely
from PFASGroups import get_componentSMARTSs, get_HalogenGroups, parse_smiles
from PFASGroups.core import preprocess_componentsSmarts
from PFASGroups.parser import preprocess_HalogenGroups

# You can load component smarts
custom_paths = get_componentSMARTSs()
# Add or modify componentSMARTSs as you want
# ....
# then preprocess them
custom_paths = preprocess_componentsSmarts(custom_paths)


# You can load HalogenGroups
custom_groups = get_HalogenGroups()
# Add or modify HalogenGroups as you want
# ....
# then preprocess them
pfas_groups, agg_groups = preprocess_HalogenGroups(custom_groups)


results = parse_smiles(
    ["C(C(F)(F)F)F"],
    componentSmartss=custom_paths,
    pfas_groups=pfas_groups,
    agg_pfas_groups = agg_groups
)

# Or extend defaults with your custom groups
from PFASGroups import get_compiled_HalogenGroups, HalogenGroup, parse_smiles, get_componentSMARTSs
from PFASGroups.core import preprocess_componentsSmarts

# Add custom PFAS groups
groups = get_compiled_HalogenGroups()  # Get defaults, precompiled
# Create a new group
custom_group = HalogenGroup(
    id=999,
    name="My Custom Group",
    smarts={"[C](F)(F)F": 1},
    componentSmarts=None,
    componentSaturation="per",
    componentHalogens="F",
    componentForm="alkyl",
    constraints={"gte": {"F": 1}}
)

# then add it to the lit of groups
groups.append(custom_group)

# pass the new set of groups as argument to parse_smiles
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=groups)

# Alternatively, select and modify some existing default_groups
from PFASGroups import get_HalogenGroups, HalogenGroup
# get and subset the default groups
default_groups = get_HalogenGroups()[2:4]
# extract one group
modified_group = default_groups[1]
# modify its parameters, here simply require presence of at least two structures
modified_group['smarts']={x:2 for x, i in modified_group['smarts'].items()}
# replace the modified group in the groups list
default_groups[1] = modified_group
# compile the groups
compiled_groups = [HalogenGroup(**x) for x in default_groups]

# Possibly add another group with max_dist_from_comp = 3
compiled_groups.append(HalogenGroup(
    id=998,
    name="Extended Distance Group",
    smarts={"[#6$([#6][OH1])]": 1},
    componentSmarts=None,
    constraints={},
    max_dist_from_comp=3  # Allow up to 3 bonds from fluorinated carbon
))

# parse the groups:
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=compiled_groups)

# Add or modify custom path types (e.g., chlorinated analogs)
paths = get_componentSMARTSs()
paths['Cl']['alkyl'] = {"per":{
                "component":"[C;X4](Cl)(Cl)!@!=!#[C;X4](Cl)(Cl)",
                "name":"Custom Chloro"}}
paths = preprocess_componentsSmarts(paths)

results = parse_smiles(["ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"], halogens = 'Cl', componentSmartss=paths)

Documentation

USER_GUIDE.md - Complete documentation with examples
QUICK_REFERENCE.md - Quick reference for common tasks

Usage Examples

See USER_GUIDE.md for comprehensive examples including:

Basic PFAS parsing and analysis
Fingerprint generation for machine learning
Custom configuration files
Batch processing
Integration with pandas and scikit-learn

Summary of changes by version

Version 3.2.2: Fixed polyhalogenated alkyl components matching less than 2 halogens. Added option to pass 'halogens' to formula constraints. Added options to include component-wide formula constraints (on dist-1 neighbours from matched C-only components).
Version 3.2.1: Added n-spacer metric for telomers and ring size for aryl and cyclic groups. These can be used in the embeddings.
Version 3.2.0: Use of BDE for computing graph resistance.
Version 3.1.4: Changed fingerprint parameters to molecule wide and component wide, with pre-configuration for best combinations.
Version 3.1.0: Added support for other halogens, changed names to be more generic (with some support for backward compatibility). Added component smarts for other halogens, cyclic and alkyl components.
Version 2.2.4 (Feb 2026): Advanced fingerprint analysis with dimensionality reduction (PCA, kernel-PCA, t-SNE, UMAP), KL divergence comparison for dataset similarity assessment, and SQL persistence for results. Added molecule prioritization tool for screening applications, ranking by similarity to reference lists or by intrinsic fluorination properties. Introduced PFASEmbeddingSet and PFASEmbedding with comprehensive analysis methods, automated plot generation, and extensive documentation.
Version 2.2.3 (Feb 2026): Added PFASEmbeddingSet container to offer easier plotting and summarising capabilities for results.
Version 2.2 (Feb 2026): Added linked_smarts option to specify a restriction on path between smarts groups and fluorinated component. Added new PFASgroups (telomers). v2.2.2 Fixed telomers and added examples and counter-examples to each PFASgroup. Removed boundary O in fluorinated components (for both Per and Polyalkyl components).
Version 2.1 (Jan 2026): Added support for multiple smarts, with individual minimum count, per PFASgroup.
Version 2.0 (Jan 2026): Major expansion of graph‑based component metrics, new coverage statistics, schema updates, and richer per‑component outputs.

Version 2.2.4 (February 2026) - Advanced Fingerprint Analysis

Major enhancement adding comprehensive dimensionality reduction and statistical comparison capabilities:

New Features:

PFASEmbeddingSet / PFASEmbedding: Unified result container with flexible embedding generation
Dimensionality Reduction: PCA, kernel-PCA, t-SNE, and UMAP with automatic plotting
Statistical Comparison: KL divergence for comparing dataset compositions
Database Persistence: SQL save/load for results
Molecule Prioritization: Screening and ranking tool for PFAS datasets
Comprehensive Documentation: Complete API reference, examples, and 100+ tests

Key Methods:

PFASEmbeddingSet.to_array(): Generate numeric embedding matrix from parsed results
PFASEmbeddingSet.column_names(): Column labels matching to_array() output
PFASEmbeddingSet.perform_pca(): Principal Component Analysis
PFASEmbeddingSet.perform_kernel_pca(): Non-linear kernel PCA
PFASEmbeddingSet.perform_tsne(): t-SNE visualization
PFASEmbeddingSet.perform_umap(): Fast UMAP dimensionality reduction
PFASEmbeddingSet.compare_kld(): Dataset similarity via KL divergence
prioritise_molecules(): Rank molecules by similarity or fluorination properties
PFASEmbeddingSet.to_sql(): Persist results to a SQLite/PostgreSQL database

Prioritization Strategies:

Reference-based: Rank by distributional similarity to known PFAS (e.g., persistent chemicals)
Intrinsic properties: Score by fluorination characteristics (total F content, chain length)
Flexible weighting: Tune parameters for different screening objectives

Use Cases:

Exploratory data analysis of PFAS inventories
Database comparison and compositional analysis
Cluster identification and pattern recognition
Machine learning preprocessing
Chemical space visualization
Priority screening for environmental monitoring
Regulatory watchlist generation

See docs/ResultsFingerprint_Guide.md, docs/prioritization.rst, examples/results_fingerprint_analysis.py, and examples/prioritization_examples.py for details.

Version 2.2.3 (February 2026) - PFASEmbeddingSet Container

New Features:

PFASEmbeddingSet container with visualization helpers
Enhanced component plotting utilities
Improved documentation and examples

Version 2.0 (January 2026) - Comprehensive Graph Metrics

Major enhancement adding comprehensive NetworkX graph theory metrics for detailed component analysis:

New Features:

Component-Level Metrics: Each fluorinated component now includes 15+ graph metrics:
- diameter and radius - Graph eccentricity bounds
- center, periphery, barycenter - Structural node sets
- effective_graph_resistance - Sum of resistance distances
- component_fraction - Fraction of molecule covered by component (includes all attached H, F, Cl, Br, I)
- Distance metrics from functional groups to structural features
Molecular Coverage Metrics: New fraction-based metrics quantify fluorination extent:
- mean_component_fraction - Average coverage per component
- total_components_fraction - Total coverage by union of all components (accounts for overlaps)
Summary Statistics: Aggregated metrics across all components per PFAS group
Enhanced Database Models: New Components model stores individual component data with all metrics
Improved Analysis: Better understanding of molecular topology, branching, functional group positioning, and fluorination extent

Breaking Changes:

parse_mols output now includes additional summary metric fields (mean_diameter, mean_radius, etc.)
Database schema changes require migration (see DATABASE_MIGRATION_GUIDE.md)

Metrics Explained:

branching (0-1): Measures linearity (1.0 = linear, 0.0 = highly branched) - renamed from "eccentricity"
mean_eccentricity, median_eccentricity: Graph-theoretic eccentricity statistics for component nodes
smarts_centrality (0-1): Functional group position (1.0 = central, 0.0 = peripheral)
n_spacer (int ≥ 0): Fluorotelomer CH₂ spacer length — the "m" in "m:n" telomer notation; 0 for all non-telomeric groups
ring_size (int ≥ 0): Smallest ring overlapping the matched component; 0 for acyclic chains, 5 for azoles/furans, 6 for benzene/cyclohexane derivatives
component_fraction (0-1): Fraction of total molecule atoms in this component (includes all attached atoms)
total_components_fraction (0-1): Fraction of molecule covered by union of all components
diameter: Maximum distance between any two atoms in component
radius: Minimum eccentricity across component nodes
barycenter: Nodes minimizing total distance to all other nodes
center: Nodes with minimum eccentricity
periphery: Nodes with maximum eccentricity

See COMPREHENSIVE_METRICS_SUMMARY.md for complete documentation.

Version 1.x: Shift to component‑based analysis with improved SMARTS matching and better handling of branched/cyclic structures.

Version 1.x - Component-Based Analysis

Replaced chain-finding with connected component analysis
Added support for branched and cyclic structures
Improved SMARTS pattern matching for diverse PFAS classes

Version 0.x - Path-Based Analysis

Find SMARTS match connected to either a second SMARTS or a default path-related SMARTS using networkx shortest_path.

Licence

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Contact me in case you want an exception to the No Derivatives term.

Acknowledgments

This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.3.1

May 5, 2026

3.2.2

Apr 2, 2026

2.2.3

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pfasgroups-3.3.1.tar.gz (287.4 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pfasgroups-3.3.1-py3-none-any.whl (238.5 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file pfasgroups-3.3.1.tar.gz.

File metadata

Download URL: pfasgroups-3.3.1.tar.gz
Upload date: May 5, 2026
Size: 287.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pfasgroups-3.3.1.tar.gz
Algorithm	Hash digest
SHA256	`7ea2581e6b42c0d038f70da8c9dda16428c508bd2500d1c2f5cc117028043a08`
MD5	`2576d5b51e2f7a3aff463f8c33e52ed6`
BLAKE2b-256	`3f0bd9c25bc30a3025eef881f19b2a810358b30cc4931b3a1b94e907f8e477f1`

See more details on using hashes here.

File details

Details for the file pfasgroups-3.3.1-py3-none-any.whl.

File metadata

Download URL: pfasgroups-3.3.1-py3-none-any.whl
Upload date: May 5, 2026
Size: 238.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pfasgroups-3.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1e9d3b78f7393093614c103951617d7d6e62ed702afaa7c998241690e3582de`
MD5	`912769bdca82a284c4fd3c6e2c14fafe`
BLAKE2b-256	`5a6e139f398e85839452f1ab5cbc5dab36336a82386c7e39914db90547847b6c`

See more details on using hashes here.

PFASGroups 3.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PFASGroups

Overview

Key Features

Core Capabilities

Additional Tools

Installation

From PyPI

From source (recommended for development)

Verify installation

Graphical User Interface (GUI)

Binary release

Repository Structure

Benchmark Summary (Mar 2026, v. 3.2.2, F groups)

Quick Start

Python API

Command Line

Filtering Components by Halogen, Form, and Saturation

Embedding with Graph Metrics

Multi-Halogen Analysis

Option A – import HalogenGroups (all halogens by default)

Option B – import PFASgroups and specify halogens explicitly

Custom Configuration

Documentation

Usage Examples

Summary of changes by version

Version 2.2.4 (February 2026) - Advanced Fingerprint Analysis

Version 2.2.3 (February 2026) - PFASEmbeddingSet Container

Version 2.0 (January 2026) - Comprehensive Graph Metrics

Version 1.x - Component-Based Analysis

Version 0.x - Path-Based Analysis

Licence

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Option A – import `HalogenGroups` (all halogens by default)

Option B – import `PFASgroups` and specify `halogens` explicitly