A comprehensive cheminformatics package for automated detection, classification, and analysis of halogenated substances, with a focus on per- and polyfluoroalkyl substances (PFAS). Combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding to identify 119 groups. Supports creations of embeddings with graph metrics for Machine Learning workflows.
Project description
PFASGroups
A comprehensive cheminformatics package for automated detection, classification, and analysis of halogenated substances, in particular per- and polyfluoroalkyl substances (PFAS) in chemical databases.
Overview
PFASGroups combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding (using RDKit and NetworkX) to identify and classify PFAS compounds. The package enables systematic PFAS universe mapping and environmental monitoring applications.
Key Features
Core Capabilities
- Halogen Group Identification: Automated detection of 119 functional groups (114 compiled for fluorine-only embedding):
- 27 PFAS OECD groups
- 48 generic functional groups (IDs 29–73, 117–119; the last 3 are halogen-context-dependent or recently added)
- 43 fluorotelomer-specific groups with CH₂ linker validation
- 1 aggregate pattern-matching group (Group 116: Telomers,
compute=False)
- Atom Reference Requirement: For non-telomer groups, SMARTS patterns must match atoms that are part of or directly connected to the fluorinated component (per/polyfluorinated carbons), respecting the
max_dist_from_compconstraint - Linker Validation: CH₂-specific validation for 40 fluorotelomer groups to distinguish from direct-attachment analogues. Telomer groups use
linker_smartsto allow functional groups separated from perfluoro chains by non-fluorinated linkers - Aggregate Groups: Pattern-matching groups that collect related PFAS groups via regex (e.g., Group 113 matches all "telomer" groups)
- Component Length Analysis: Quantification of per- and polyfluorinated alkyl components with CF₂ unit counting
- Graph Metrics: Comprehensive structural characterization (branching, eccentricity, diameter, resistance, centrality, telomer spacer length, ring size)
- Customizable Definitions: Easy extension to additional PFAS groups and halogenated chemical classes via JSON configuration. Component type definitions in
data/component_smarts_halogens.jsonsupport optional per-typeconstraints(e.g.{"gte": {"F": 2}}) that are evaluated against the component's full atom count (backbone carbons plus attached halogens) to enforce minimum halogen counts or element exclusions — without repeating the check in every group definition
Additional Tools
- Homologue Series Generation: Iterative component shortening to explore theoretical chemical space
- Fingerprint Generation: PFAS fingerprints for machine learning applications
- Visualization: Assign and visualize PFAS groupings
- Multiple Interfaces: Python API, command-line tool, and browser-based JavaScript version (RDKitJS)
- Batch Processing: Efficient analysis of large chemical databases
Installation
From Pypi
pip install PFASGroups
From source (recommended for development)
Prerequisites: Python ≥ 3.7, RDKit (install via conda or pip before the steps below).
# Clone the repository
git clone https://github.com/lucmiaz/PFASGroups.git
cd PFASGroups
# Install in editable mode (development install)
pip install -e .
Note for conda users: install RDKit via conda before running pip:
conda install -c conda-forge rdkit
After installation, both HalogenGroups (all halogens by default) and PFASgroups (fluorine only by default) are importable, and the pfasgroups CLI command is available in your terminal.
Verify installation
from PFASGroups import parse_smiles
results = parse_smiles("FC(F)(F)C(F)(F)C(=O)O") # → PFASEmbeddingSet
print(results) # prints PFASEmbeddingSet summary (molecule count, matched groups, …)
print(results[0]) # prints PFASEmbedding summary for the first molecule
Repository Structure
PFASGroups/
├── HalogenGroups/ # Multi-halogen wrapper package
│ └── __init__.py # Wraps PFASgroups; defaults halogens=['F','Cl','Br','I']
├── PFASGroups/ # Core implementation package (also importable as PFASgroups)
│ ├── __init__.py # Public API
│ ├── core.py # SMARTS matching engine, component detection, decorators
│ ├── parser.py # parse_smiles / parse_mols entry points
│ ├── embeddings.py # presets
│ ├── PFASEmbeddings.py # PFASEmbeddingSet, PFASEmbedding
│ ├── HalogenGroupModel.py # HalogenGroup data model
│ ├── PFASDefinitionModel.py # PFASDefinition model
│ ├── ComponentsSolverModel.py # Graph-based path-finding solver
│ ├── getter.py # get_HalogenGroups, get_componentSMARTSs, …
│ ├── cli.py # Command-line interface (halogengroups / pfasgroups)
│ ├── prioritise.py # prioritise_molecules
│ ├── generate_homologues.py # Homologue series generation
│ ├── fragmentation.py # Fragment utilities
│ ├── draw_mols.py # Molecule and group visualisation helpers
│ └── data/ # JSON configuration files (bundled with the package)
│ ├── Halogen_groups_smarts.json # 113 halogen group definitions
│ ├── component_smarts.json # Fluorinated component SMARTS patterns
│ ├── component_smarts_halogens.json # Multi-halogen component SMARTS patterns
│ └── PFAS_definitions_smarts.json # PFAS regulatory definitions
├── tests/ # Pytest test suite (25+ test modules)
│ ├── test_halogen_groups_smarts.py
│ ├── test_results_fingerprint.py
│ ├── test_results_model.py
│ ├── test_prioritise.py
│ └── …
├── examples/ # Standalone usage example scripts
│ ├── results_fingerprint_analysis.py
│ ├── prioritization_examples.py
│ └── …
├── docs/ # Sphinx documentation source
│ ├── quickstart.rst
│ ├── algorithm.rst
│ ├── customization.rst
│ ├── ResultsFingerprint_Guide.md
│ └── …
├── benchmark/ # Benchmarking scripts and timing reports
│ ├── data/
│ ├── reports/
│ └── …
└── pyproject.toml # Package metadata and build configuration
Two importable packages, one codebase:
| Package | Default halogens |
Typical use |
|---|---|---|
HalogenGroups |
['F', 'Cl', 'Br', 'I'] (all) |
Multi-halogen analysis |
PFASgroups |
'F' (fluorine only) |
PFAS / fluorine-focused analysis |
Benchmark Summary (Mar 2026, v. 3.2.2, F groups)
Benchmarks were run on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4C/8T), 15.5 GB RAM, Python 3.9.23, RDKit 2025.09.2, NetworkX 3.2.1 (the old version of Python was taken for compatibility with PFAS-atlas).
| Dataset/Profile | Count | Atom range | PFASgroups mean/median (ms) | PFAS-Atlas mean/median (ms) | Relative speed | Notes |
|---|---|---|---|---|---|---|
| OECD reference (real compounds) | 3,707 | varied | 26.3 / 20.8 | 32.4 / 37.0 | 1.23x faster | Real-world OECD dataset. |
| CLInventory, F only (real compounds) | 28,328 | varied | 23.2 / 17.5 | 5.1 / 1.0 | 4.6x slower | Large real-world dataset; Atlas faster due to smaller typical molecule size. |
| Large PFAS (≥35 atoms) | 1,242 | ≥35 | 526.0 / 253.4 | 93.8 / 72.5 | 5.6x slower | Large-molecule subset; heavy-tail runtime. |
| Stress-test (full metrics) | 2,500 | 11-622 | 271.8 / 35.9 | 67.0 / 38.7 | 4.1x slower | Synthetic stress-test; heavy-tail runtime. |
| Stress-test (no EGR) | 2,500 | 11-625 | 272.6 / 31.6 | 67.2 / 38.7 | ≈ full | Disables effective graph resistance; negligible speedup. |
| Stress-test (no metrics) | 2,500 | 9-621 | 205.1 / 28.7 | 67.0 / 38.4 | 1.33x faster vs full | Disables all component graph metrics. |
Timing profile plots (full vs no EGR vs no metrics):
Disable or limit graph metrics in the Python API:
from PFASGroups import parse_smiles
smiles_list = ['C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O',
'C1=CC(=CC=C1[N+](=O)[O-])OC2=C(C=C(C=C2)C(F)(F)F)[N+](=O)[O-]',
'C(C(C(C(F)(F)S(=O)(=O)O)(F)F)(F)F)(C(C(F)(F)F)(F)F)(C(F)(F)F)C(F)(F)F'
]
# Skip all component graph metrics (fastest)
rFull = parse_smiles(smiles_list, compute_component_metrics=False)
print(rFull)
rFull.summary()
# Keep metrics but skip effective graph resistance entirely
rNoEGR = parse_smiles(smiles_list, limit_effective_graph_resistance=0)
print(rNoEGR)
# Compute resistance only for components below a size threshold
rLimitEGR = parse_smiles(smiles_list, limit_effective_graph_resistance=20)
print(rLimitEGR)
CLI equivalents:
# Skip all component graph metrics (fastest)
pfasgroups parse --no-component-metrics "C(C(F)(F)F)F"
# Skip effective graph resistance entirely
pfasgroups parse --limit-effective-graph-resistance 0 "C(C(F)(F)F)F"
# Compute resistance only for components below a size threshold
pfasgroups parse --limit-effective-graph-resistance 200 "C(C(F)(F)F)F"
Quick Start
Python API
from PFASGroups import parse_smiles
# Parse PFAS structures — returns a PFASEmbeddingSet
smiles_list = ["C[Si](C)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(=C(C(F)(F)F)C(F)(F)F)C(F)(F)C(F)(F)F",
"FC(F)(F)C(F)(C(F)(F)F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
"C[Si](Cl)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"CC(=O)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)(F)N1C(F)(F)C(F)(F)C(F)(F)C(F)(F)C1(F)F",
"FC(F)(F)C(F)(F)C(F)(F)CCl",
"OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"OS(=O)(=O)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)COC(=O)C=C",
"FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)OC=C",
"FC(F)(F)C(I)(C(F)(F)F)C(F)(F)F",
"OC(C(F)(F)C(F)(F)F)(C(F)(F)C(F)(F)F)C(F)(F)C(F)(F)F",
"C[Si](Cl)(Cl)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"CCN(CC)CC.NS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"[OH-].C[N+](C)(C)CCCNS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"[K+].[O-]S(=O)(=O)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
"[NH4+].[O-]P(=O)(OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F)OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"OCCN(CCO)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)F",
"OP(O)(=O)OCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C(F)(F)F",
"FC(F)(F)C(F)(F)C(F)(F)C(F)(F)N(C(F)(F)C(F)(F)C(F)(F)C(F)(F)F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"COC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)OCC=C",
"CC(C)CCOC(=O)CC(NC(=O)C(F)(F)C(F)(F)C(F)(F)F)C(=O)OCCC(C)C",
"CN(C(=O)C(F)(F)C(F)(F)C(F)(F)F)C(=O)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)(F)C(F)(F)C(F)(F)C=CI",
"OC(=O)C1=CC=CC=C1NC(=O)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)(F)C1(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)(F)F)C1(F)F"]
results = parse_smiles(smiles_list) # → PFASEmbeddingSet
print(results) # summary: molecule count, top groups, …
print(results[0]) # per-molecule summary for first entry
# Embedding from a pre-parsed set (avoids re-parsing)
arr = results.to_array() # default: binary, all groups
arr = results.to_array(group_selection='oecd', component_metrics=['binary'])
cols = results.column_names() # matching column labels
# Filter components by halogen, form, and saturation
results_f = parse_smiles(smiles_list, halogens='F') # Fluorine only
results_pfa = parse_smiles(smiles_list, halogens='F', saturation='per', form='alkyl') # Perfluoroalkyl only
results_cyclic = parse_smiles(smiles_list, form='cyclic') # Cyclic forms only
# Dimensionality reduction (methods on PFASEmbeddingSet)
pca_result = results.perform_pca(n_components=5, plot=True)
tsne_result = results.perform_tsne(perplexity=30, plot=True)
umap_result = results.perform_umap(n_neighbors=15, plot=True)
# Colour each point by the PFAS group with the highest match count
pca_result = results.perform_pca(n_components=5, color_by='top_group')
tsne_result = results.perform_tsne(perplexity=30, color_by='top_group')
umap_result = results.perform_umap(n_neighbors=15, color_by='top_group')
# Or supply your own per-molecule labels (e.g. from external metadata)
labels = ['industrial'] * 10 + ['environmental'] * (len(smiles_list) - 10)
pca_result = results.perform_pca(color_by=labels)
# Compare two datasets using KL divergence
other_smiles_list = smiles_list[0:10] + \
["OC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(O)=O",
"OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F",
"FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)COCC=C"]
other_results = parse_smiles(other_smiles_list)
similarity = results.compare_kld(other_results, method='minmax')
print(f"Similarity between the lists of compounds: {similarity:2.3f}")
# Save to SQL database
results.to_sql(filename='results.db')
# Prioritization tool for screening and ranking
from PFASGroups import prioritise_molecules
# Prioritize by similarity to reference list (e.g., known persistent PFAS)
reference = ["FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(=O)O"]
prioritised, scores = prioritise_molecules(
molecules=smiles_list,
reference=reference,
return_scores=True
)
print(f"Prioritisation order: {'\n'.join([x['smiles'] for x in prioritised])}")
# Prioritize by intrinsic fluorination properties (long chains, high F content)
prioritised = prioritise_molecules(
molecules=smiles_list,
a=1.0, # Weight for total fluorination
b=5.0, # Weight for longest chains
percentile=90, # Focus on 90th percentile of component sizes
return_scores = False
)
print(f"Prioritisation order: {'\n'.join([x['smiles'] for x in prioritised])}")
Command Line
# Parse SMILES strings
pfasgroups parse "C(C(F)(F)F)F" "FC(F)(F)C(F)(F)C(=O)O"
# Generate fingerprints
pfasgroups fingerprint "C(C(F)(F)F)F" --format dict
# List available PFAS groups
pfasgroups list-groups
Filtering Components by Halogen, Form, and Saturation
Filter component matches by specific halogens, chemical forms, or saturation levels:
from HalogenGroups import parse_smiles
smiles_list = ["C(C(F)(F)F)F", "FC(F)(F)C(F)(F)C(=O)O", "C(C(Cl)(Cl)Cl)Cl"]
# Filter only fluorine components
results_f = parse_smiles(smiles_list, halogens='F')
# Filter perfluorinated alkyl compounds
results_pfa = parse_smiles(smiles_list, halogens='F', saturation='per', form='alkyl')
# Filter polyfluorinated cyclic compounds
results_polyf_cyclic = parse_smiles(smiles_list, halogens='F', saturation='poly', form='cyclic')
# Filter multiple halogens (F and Cl)
results_multi = parse_smiles(smiles_list, halogens=['F', 'Cl'])
# Valid filter options:
# - halogens: 'F', 'Cl', 'Br', 'I', or list like ['F', 'Cl']
# - saturation: 'per' or 'poly' (or list like ['per', 'poly'] for both)
# - form: 'alkyl' or 'cyclic' (or list like ['alkyl', 'cyclic'] for both)
Embedding with Graph Metrics
The to_array() / to_fingerprint() methods accept a component_metrics list that
stacks one block of N_G columns per metric (default N_G = 114 for fluorine-only).
import numpy as np
from PFASGroups import parse_smiles, EMBEDDING_PRESETS
smiles = [
"O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F", # PFOA
"O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F", # PFHpA
"OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)F", # 4:2 FTOH
"OCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F", # 6:2 FTOH
]
results = parse_smiles(smiles)
# --- binary embedding (default) ---
arr_bin = results.to_array() # shape (4, 114)
# --- preset 'best': binary + effective_graph_resistance ---
# Best discrimination (mean Tanimoto 0.184, outperforms TxP-PFAS 129-bit)
print(EMBEDDING_PRESETS['best']['description'])
arr_best = results.to_array(preset='best') # shape (4, 228) = 114 × 2
print(arr_best.shape)
# --- effective graph resistance directly ---
arr_egr = results.to_array(
component_metrics=['binary', 'effective_graph_resistance']
) # shape (4, 228)
print(arr_egr.shape)
# --- n_spacer: telomer CH₂ spacer length (encodes 'm' in 'm:n' notation) ---
# Zero for non-telomers; distinguishes 2:1, 4:2, 6:2 FTOHs in a single column
arr_ns = results.to_array(component_metrics=['n_spacer']) # shape (4, 114)
print(arr_egr.ns)
# --- ring_size: smallest ring overlapping each matched component ---
# Zero for acyclic groups; 5 for azoles/furans; 6 for benzene/cyclohexane
arr_rs = results.to_array(component_metrics=['ring_size']) # shape (4, 114)
print(arr_rs.shape)
# --- combined: EGR + n_spacer + ring_size + molecule-wide descriptors ---
arr_combined = results.to_array(
component_metrics=['binary', 'effective_graph_resistance',
'n_spacer', 'ring_size'],
molecule_metrics=['n_components', 'max_size',
'mean_branching', 'max_branching',
'mean_component_fraction', 'max_component_fraction'],
) # shape (4, 4*114 + 6) = (4, 462)
print(arr_combined.shape)
# --- multi-halogen: one parse per halogen, then hstack ---
halogens = ['F','Cl','Br','I']
r_hal = parse_smiles(smiles, halogens=halogens)
arr_4x = r_hal.to_array(component_metrics=['effective_graph_resistance'], molecule_metrics=['n_components','max_size','mean_branching'], halogens=halogens)
print(arr_4x.shape)
# --- column names ---
cols = results.column_names(
component_metrics=['binary', 'effective_graph_resistance']
)
print(cols[:4]) # e.g. ['Perfluoroalkyl [binary]', ..., 'Perfluoroalkyl [EGR]', ...]
print(arr_4x[:4])
See examples/embedding_with_graph_metrics.py
for a complete runnable script covering all options.
Multi-Halogen Analysis
PFASGroups supports fluorine, chlorine, bromine, and iodine. There are two ways to analyse all halogens at once:
Option A – import HalogenGroups (all halogens by default)
from HalogenGroups import parse_smiles
smiles_list = [
"C(C(F)(F)F)F", # fluorinated
"ClC(Cl)(Cl)C(Cl)(Cl)Cl", # chlorinated
"BrC(Br)(Br)CBr", # brominated
]
# halogens defaults to ['F','Cl','Br','I'] — no extra argument needed
results = parse_smiles(smiles_list) # → PFASEmbeddingSet
# to_array() reads the halogen info already captured during parsing
arr = results.to_array(group_selection='oecd', component_metrics=['binary'], halogens = ['F','Cl','Br','I'])
cols = results.column_names(group_selection='oecd')
print(arr.shape)
Option B – import PFASgroups and specify halogens explicitly
from PFASgroups import parse_smiles
smiles_list = [
"C(C(F)(F)F)F",
"ClC(Cl)(Cl)C(Cl)(Cl)Cl",
"BrC(Br)(Br)CBr",
]
# Explicitly pass all halogens
results = parse_smiles(smiles_list, halogens=['F', 'Cl', 'Br', 'I']) # → PFASEmbeddingSet
arr = results.to_array(group_selection='oecd', component_metrics=['binary'], halogens=['F', 'Cl', 'Br', 'I'])
cols = results.column_names(group_selection='oecd')
Custom Configuration
Use custom pathtype definitions and PFAS groups:
# Load custom files entirely
from PFASGroups import get_componentSMARTSs, get_HalogenGroups, parse_smiles
from PFASGroups.core import preprocess_componentsSmarts
from PFASGroups.parser import preprocess_HalogenGroups
# You can load component smarts
custom_paths = get_componentSMARTSs()
# Add or modify componentSMARTSs as you want
# ....
# then preprocess them
custom_paths = preprocess_componentsSmarts(custom_paths)
# You can load HalogenGroups
custom_groups = get_HalogenGroups()
# Add or modify HalogenGroups as you want
# ....
# then preprocess them
pfas_groups, agg_groups = preprocess_HalogenGroups(custom_groups)
results = parse_smiles(
["C(C(F)(F)F)F"],
componentSmartss=custom_paths,
pfas_groups=pfas_groups,
agg_pfas_groups = agg_groups
)
# Or extend defaults with your custom groups
from PFASGroups import get_compiled_HalogenGroups, HalogenGroup, parse_smiles, get_componentSMARTSs
from PFASGroups.core import preprocess_componentsSmarts
# Add custom PFAS groups
groups = get_compiled_HalogenGroups() # Get defaults, precompiled
# Create a new group
custom_group = HalogenGroup(
id=999,
name="My Custom Group",
smarts={"[C](F)(F)F": 1},
componentSmarts=None,
componentSaturation="per",
componentHalogens="F",
componentForm="alkyl",
constraints={"gte": {"F": 1}}
)
# then add it to the lit of groups
groups.append(custom_group)
# pass the new set of groups as argument to parse_smiles
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=groups)
# Alternatively, select and modify some existing default_groups
from PFASGroups import get_HalogenGroups, HalogenGroup
# get and subset the default groups
default_groups = get_HalogenGroups()[2:4]
# extract one group
modified_group = default_groups[1]
# modify its parameters, here simply require presence of at least two structures
modified_group['smarts']={x:2 for x, i in modified_group['smarts'].items()}
# replace the modified group in the groups list
default_groups[1] = modified_group
# compile the groups
compiled_groups = [HalogenGroup(**x) for x in default_groups]
# Possibly add another group with max_dist_from_comp = 3
compiled_groups.append(HalogenGroup(
id=998,
name="Extended Distance Group",
smarts={"[#6$([#6][OH1])]": 1},
componentSmarts=None,
constraints={},
max_dist_from_comp=3 # Allow up to 3 bonds from fluorinated carbon
))
# parse the groups:
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=compiled_groups)
# Add or modify custom path types (e.g., chlorinated analogs)
paths = get_componentSMARTSs()
paths['Cl']['alkyl'] = {"per":{
"component":"[C;X4](Cl)(Cl)!@!=!#[C;X4](Cl)(Cl)",
"name":"Custom Chloro"}}
paths = preprocess_componentsSmarts(paths)
results = parse_smiles(["ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"], halogens = 'Cl', componentSmartss=paths)
Documentation
- USER_GUIDE.md - Complete documentation with examples
- QUICK_REFERENCE.md - Quick reference for common tasks
Usage Examples
See USER_GUIDE.md for comprehensive examples including:
- Basic PFAS parsing and analysis
- Fingerprint generation for machine learning
- Custom configuration files
- Batch processing
- Integration with pandas and scikit-learn
Summary of changes by version
-
Version 3.2.2: Fixed polyhalogenated alkyl components matching less than 2 halogens. Added option to pass 'halogens' to formula constraints. Added options to include component-wide formula constraints (on dist-1 neighbours from matched C-only components).
-
Version 3.2.1: Added n-spacer metric for telomers and ring size for aryl and cyclic groups. These can be used in the embeddings.
-
Version 3.2.0: Use of BDE for computing graph resistance.
-
Version 3.1.4: Changed fingerprint parameters to molecule wide and component wide, with pre-configuration for best combinations.
-
Version 3.1.0: Added support for other halogens, changed names to be more generic (with some support for backward compatibility). Added component smarts for other halogens, cyclic and alkyl components.
-
Version 2.2.4 (Feb 2026): Advanced fingerprint analysis with dimensionality reduction (PCA, kernel-PCA, t-SNE, UMAP), KL divergence comparison for dataset similarity assessment, and SQL persistence for results. Added molecule prioritization tool for screening applications, ranking by similarity to reference lists or by intrinsic fluorination properties. Introduced
PFASEmbeddingSetandPFASEmbeddingwith comprehensive analysis methods, automated plot generation, and extensive documentation. -
Version 2.2.3 (Feb 2026): Added
PFASEmbeddingSetcontainer to offer easier plotting and summarising capabilities for results. -
Version 2.2 (Feb 2026): Added linked_smarts option to specify a restriction on path between smarts groups and fluorinated component. Added new PFASgroups (telomers). v2.2.2 Fixed telomers and added examples and counter-examples to each PFASgroup. Removed boundary O in fluorinated components (for both Per and Polyalkyl components).
-
Version 2.1 (Jan 2026): Added support for multiple smarts, with individual minimum count, per PFASgroup.
-
Version 2.0 (Jan 2026): Major expansion of graph‑based component metrics, new coverage statistics, schema updates, and richer per‑component outputs.
Version 2.2.4 (February 2026) - Advanced Fingerprint Analysis
Major enhancement adding comprehensive dimensionality reduction and statistical comparison capabilities:
New Features:
- PFASEmbeddingSet / PFASEmbedding: Unified result container with flexible embedding generation
- Dimensionality Reduction: PCA, kernel-PCA, t-SNE, and UMAP with automatic plotting
- Statistical Comparison: KL divergence for comparing dataset compositions
- Database Persistence: SQL save/load for results
- Molecule Prioritization: Screening and ranking tool for PFAS datasets
- Comprehensive Documentation: Complete API reference, examples, and 100+ tests
Key Methods:
PFASEmbeddingSet.to_array(): Generate numeric embedding matrix from parsed resultsPFASEmbeddingSet.column_names(): Column labels matchingto_array()outputPFASEmbeddingSet.perform_pca(): Principal Component AnalysisPFASEmbeddingSet.perform_kernel_pca(): Non-linear kernel PCAPFASEmbeddingSet.perform_tsne(): t-SNE visualizationPFASEmbeddingSet.perform_umap(): Fast UMAP dimensionality reductionPFASEmbeddingSet.compare_kld(): Dataset similarity via KL divergenceprioritise_molecules(): Rank molecules by similarity or fluorination propertiesPFASEmbeddingSet.to_sql(): Persist results to a SQLite/PostgreSQL database
Prioritization Strategies:
- Reference-based: Rank by distributional similarity to known PFAS (e.g., persistent chemicals)
- Intrinsic properties: Score by fluorination characteristics (total F content, chain length)
- Flexible weighting: Tune parameters for different screening objectives
Use Cases:
- Exploratory data analysis of PFAS inventories
- Database comparison and compositional analysis
- Cluster identification and pattern recognition
- Machine learning preprocessing
- Chemical space visualization
- Priority screening for environmental monitoring
- Regulatory watchlist generation
See docs/ResultsFingerprint_Guide.md, docs/prioritization.rst, examples/results_fingerprint_analysis.py, and examples/prioritization_examples.py for details.
Version 2.2.3 (February 2026) - PFASEmbeddingSet Container
New Features:
PFASEmbeddingSetcontainer with visualization helpers- Enhanced component plotting utilities
- Improved documentation and examples
Version 2.0 (January 2026) - Comprehensive Graph Metrics
Major enhancement adding comprehensive NetworkX graph theory metrics for detailed component analysis:
New Features:
- Component-Level Metrics: Each fluorinated component now includes 15+ graph metrics:
diameterandradius- Graph eccentricity boundscenter,periphery,barycenter- Structural node setseffective_graph_resistance- Sum of resistance distancescomponent_fraction- Fraction of molecule covered by component (includes all attached H, F, Cl, Br, I)- Distance metrics from functional groups to structural features
- Molecular Coverage Metrics: New fraction-based metrics quantify fluorination extent:
mean_component_fraction- Average coverage per componenttotal_components_fraction- Total coverage by union of all components (accounts for overlaps)
- Summary Statistics: Aggregated metrics across all components per PFAS group
- Enhanced Database Models: New
Componentsmodel stores individual component data with all metrics - Improved Analysis: Better understanding of molecular topology, branching, functional group positioning, and fluorination extent
Breaking Changes:
parse_molsoutput now includes additional summary metric fields (mean_diameter,mean_radius, etc.)- Database schema changes require migration (see
DATABASE_MIGRATION_GUIDE.md)
Metrics Explained:
branching(0-1): Measures linearity (1.0 = linear, 0.0 = highly branched) - renamed from "eccentricity"mean_eccentricity,median_eccentricity: Graph-theoretic eccentricity statistics for component nodessmarts_centrality(0-1): Functional group position (1.0 = central, 0.0 = peripheral)n_spacer(int ≥ 0): Fluorotelomer CH₂ spacer length — the "m" in "m:n" telomer notation; 0 for all non-telomeric groupsring_size(int ≥ 0): Smallest ring overlapping the matched component; 0 for acyclic chains, 5 for azoles/furans, 6 for benzene/cyclohexane derivativescomponent_fraction(0-1): Fraction of total molecule atoms in this component (includes all attached atoms)total_components_fraction(0-1): Fraction of molecule covered by union of all componentsdiameter: Maximum distance between any two atoms in componentradius: Minimum eccentricity across component nodesbarycenter: Nodes minimizing total distance to all other nodescenter: Nodes with minimum eccentricityperiphery: Nodes with maximum eccentricity
See COMPREHENSIVE_METRICS_SUMMARY.md for complete documentation.
- Version 1.x: Shift to component‑based analysis with improved SMARTS matching and better handling of branched/cyclic structures.
Version 1.x - Component-Based Analysis
- Replaced chain-finding with connected component analysis
- Added support for branched and cyclic structures
- Improved SMARTS pattern matching for diverse PFAS classes
Version 0.x - Path-Based Analysis
- Find SMARTS match connected to either a second SMARTS or a default path-related SMARTS using networkx shortest_path.
Licence
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Contact me in case you want an exception to the No Derivatives term.
Acknowledgments
This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pfasgroups-3.2.2.tar.gz.
File metadata
- Download URL: pfasgroups-3.2.2.tar.gz
- Upload date:
- Size: 238.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b539df52b1ad0dcdd0c64d1081eeed41a782d60a34c68a26018b1841c3d9c800
|
|
| MD5 |
a60c1fc9d342dc72955fe8912582c459
|
|
| BLAKE2b-256 |
5b3798e168a832a78def596bdb7aa91526cbd3b443dc6b0cd60a24a5674d064f
|
File details
Details for the file pfasgroups-3.2.2-py3-none-any.whl.
File metadata
- Download URL: pfasgroups-3.2.2-py3-none-any.whl
- Upload date:
- Size: 180.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
881d10db9024903eebd104381abf385441eb47a2cd0d8690d7429fc7bc136530
|
|
| MD5 |
1414c2a436fb23b8b8e2b421cf64ab12
|
|
| BLAKE2b-256 |
d1b8ecbfdcb4baaea6af804530c15751614c87fb0d19df3c1287f5d1b38ec36c
|