Skip to main content

A comprehensive cheminformatics package for automated detection, classification, and analysis of per- and polyfluoroalkyl substances (PFAS). Combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding to identify 55 PFAS group classifications (28 OECD-defined groups and 27 generic categories) with fluorinated chain length determination.

Project description

PFASgroups

A comprehensive cheminformatics package for automated detection, classification, and analysis of per- and polyfluoroalkyl substances (PFAS) in chemical databases.

Overview

PFASgroups combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding (using RDKit and NetworkX) to identify and classify PFAS compounds. The package enables systematic PFAS universe mapping and environmental monitoring applications.

Key Features

Core Capabilities

  • PFAS Group Identification: Automated detection of 113 functional groups:
    • 72 non-telomer groups (OECD-defined and generic categories)
    • 40 fluorotelomer groups with linker validation (Groups 69-112)
    • 1 aggregate pattern-matching group (Group 113: Telomers)
  • Atom Reference Requirement: For non-telomer groups, SMARTS patterns must match atoms that are part of or directly connected to the fluorinated component (per/polyfluorinated carbons), respecting the max_dist_from_CF constraint
  • Linker Validation: CH₂-specific validation for 40 fluorotelomer groups to distinguish from direct-attachment analogues. Telomer groups use linker_smarts to allow functional groups separated from perfluoro chains by non-fluorinated linkers
  • Aggregate Groups: Pattern-matching groups that collect related PFAS groups via regex (e.g., Group 113 matches all "telomer" groups)
  • Component Length Analysis: Quantification of per- and polyfluorinated alkyl components with CF₂ unit counting
  • Graph Metrics: Comprehensive structural characterization (branching, eccentricity, diameter, resistance, centrality)
  • Customizable Definitions: Easy extension to additional PFAS groups and halogenated chemical classes via JSON configuration

Additional Tools

  • Homologue Series Generation: Iterative component shortening to explore theoretical chemical space
  • Fingerprint Generation: PFAS fingerprints for machine learning applications
  • Visualization: Assign and visualize PFAS groupings
  • Multiple Interfaces: Python API, command-line tool, and browser-based JavaScript version (RDKitJS)
  • Batch Processing: Efficient analysis of large chemical databases

Installation

Clone the repository and install dependencies:

pip install -e .

After installation, the pfasgroups command will be available in your terminal.

Benchmark Summary (Feb 2026)

Benchmarks were run on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4C/8T), 15.5 GB RAM, Python 3.9.23, RDKit 2025.09.2, NetworkX 3.2.1 (the old version of Python was taken for compatibility with PFAS-atlas).

Dataset/Profile Count Atom range PFASgroups mean/median (ms) PFAS-Atlas mean/median (ms) Relative speed Notes
OECD reference (real compounds) 3,414 Typical small/medium 19.2 / 14.8 38.8 / 37.7 2.02x faster Real-world dataset representing existing compounds.
Timing stress-test (full metrics) 2,500 11-625 251.8 / 24.4 58.5 / 34.2 0.23x Synthetic stress-test with large molecules; heavy-tail runtime.
Timing stress-test (no resistance) 2,500 11-619 176.7 / 24.8 N/A 1.43x faster vs full Disables effective graph resistance only.
Timing stress-test (no metrics) 2,500 11-619 97.7 / 19.7 N/A 2.58x faster vs full Disables all component graph metrics.

Timing profile plots (full vs no resistance vs no metrics):

Disable or limit graph metrics in the Python API:

from PFASgroups import parse_smiles

# Skip all component graph metrics (fastest)
parse_smiles(smiles_list, compute_component_metrics=False)

# Keep metrics but skip effective graph resistance entirely
parse_smiles(smiles_list, limit_effective_graph_resistance=0)

# Compute resistance only for components below a size threshold
parse_smiles(smiles_list, limit_effective_graph_resistance=200)

CLI equivalents:

# Skip all component graph metrics (fastest)
pfasgroups parse --no-component-metrics "C(C(F)(F)F)F"

# Skip effective graph resistance entirely
pfasgroups parse --limit-effective-graph-resistance 0 "C(C(F)(F)F)F"

# Compute resistance only for components below a size threshold
pfasgroups parse --limit-effective-graph-resistance 200 "C(C(F)(F)F)F"

Quick Start

Python API

from PFASgroups import parse_smiles, generate_fingerprint

# Parse PFAS structures
smiles_list = ["C(C(F)(F)F)F", "FC(F)(F)C(F)(F)C(=O)O"]
results = parse_smiles(smiles_list)

# Generate fingerprints
fingerprints, group_info = generate_fingerprint(smiles_list)

Command Line

# Parse SMILES strings
pfasgroups parse "C(C(F)(F)F)F" "FC(F)(F)C(F)(F)C(=O)O"

# Generate fingerprints
pfasgroups fingerprint "C(C(F)(F)F)F" --format dict

# List available PFAS groups
pfasgroups list-groups

Custom Configuration

Use custom pathtype definitions and PFAS groups:

# Load custom files entirely
from PFASgroups import get_componentSmartss, get_PFASGroups, parse_smiles

custom_paths = get_componentSmartss(filename='my_component_smartss.json')
custom_groups = get_PFASGroups(filename='my_groups.json')

results = parse_smiles(
    ["C(C(F)(F)F)F"],
    componentSmartss=custom_paths,
    pfas_groups=custom_groups
)
# Or extend defaults with your custom groups
from PFASgroups import get_PFASGroups, PFASGroup, parse_smiles, compile_componentSmarts, get_componentSmartss

# Add custom PFAS groups
groups = get_PFASGroups()  # Get defaults
groups.append(PFASGroup(
    id=999,
    name="My Custom Group",
    smarts1="[C](F)(F)F",
    smarts2="[N+](=O)[O-]",
    componentSmarts="Perfluoroalkyl",
    constraints={"nF": [3, None]}
))

results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=groups)

# Custom max_dist_from_CF parameter
# For functional groups without formula constraints, when bycomponent=True,
# the max_dist_from_CF parameter limits the maximum bond distance between
# a functional group match and a fluorinated carbon terminal atom (default: 0)
groups.append(PFASGroup(
    id=998,
    name="Extended Distance Group",
    smarts1="[#6$([#6][OH1])]",
    smarts2=None,
    componentSmarts=None,
    constraints={},
    max_dist_from_CF=3  # Allow up to 3 bonds from fluorinated carbon
))

# Add custom path types (e.g., chlorinated analogs)
paths = get_componentSmartss()
paths['Perchlorinated'] = compile_componentSmarts(
    "[C;X4](Cl)(Cl)!@!=!#[C;X4](Cl)(Cl)",  # component pattern
    "[C;X4](Cl)(Cl)Cl"                     # end pattern
)

results = parse_smiles(["ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"], componentSmartss=paths)
# Via command line
pfasgroups parse --groups-file my_custom_groups.json "C(C(F)(F)F)F"

# List available groups and paths
pfasgroups list-groups
pfasgroups list-paths

Documentation

Usage Examples

See USER_GUIDE.md for comprehensive examples including:

  • Basic PFAS parsing and analysis
  • Fingerprint generation for machine learning
  • Custom configuration files
  • Batch processing
  • Integration with pandas and scikit-learn

Summary of changes by version

  • Version 2.2 (Feb 2026): Added linked_smarts option to specify a restriction on path between smarts groups and fluorinated component. Added new PFASgroups (telomers). v2.2.2 Fixed telomers and added examples and counter-examples to each PFASgroup. Removed boundary O in fluorinated components (for both Per and Polyalkyl components). v2.2.3 Added resultsModel to offer easier plotting and summarising capabilities for results.

  • Version 2.1 (Jan 2026): Added support for multiple smarts, with individual minimum count, per PFASgroup.

  • Version 2.0 (Jan 2026): Major expansion of graph‑based component metrics, new coverage statistics, schema updates, and richer per‑component outputs.

Version 2.0 (January 2026) - Comprehensive Graph Metrics

Major enhancement adding comprehensive NetworkX graph theory metrics for detailed component analysis:

New Features:

  • Component-Level Metrics: Each fluorinated component now includes 15+ graph metrics:
    • diameter and radius - Graph eccentricity bounds
    • center, periphery, barycenter - Structural node sets
    • effective_graph_resistance - Sum of resistance distances
    • component_fraction - Fraction of molecule covered by component (includes all attached H, F, Cl, Br, I)
    • Distance metrics from functional groups to structural features
  • Molecular Coverage Metrics: New fraction-based metrics quantify fluorination extent:
    • mean_component_fraction - Average coverage per component
    • total_components_fraction - Total coverage by union of all components (accounts for overlaps)
  • Summary Statistics: Aggregated metrics across all components per PFAS group
  • Enhanced Database Models: New Components model stores individual component data with all metrics
  • Improved Analysis: Better understanding of molecular topology, branching, functional group positioning, and fluorination extent

Breaking Changes:

  • parse_mols output now includes additional summary metric fields (mean_diameter, mean_radius, etc.)
  • Database schema changes require migration (see DATABASE_MIGRATION_GUIDE.md)

Metrics Explained:

  • branching (0-1): Measures linearity (1.0 = linear, 0.0 = highly branched) - renamed from "eccentricity"
  • mean_eccentricity, median_eccentricity: Graph-theoretic eccentricity statistics for component nodes
  • smarts_centrality (0-1): Functional group position (1.0 = central, 0.0 = peripheral)
  • component_fraction (0-1): Fraction of total molecule atoms in this component (includes all attached atoms)
  • total_components_fraction (0-1): Fraction of molecule covered by union of all components
  • diameter: Maximum distance between any two atoms in component
  • radius: Minimum eccentricity across component nodes
  • barycenter: Nodes minimizing total distance to all other nodes
  • center: Nodes with minimum eccentricity
  • periphery: Nodes with maximum eccentricity

See COMPREHENSIVE_METRICS_SUMMARY.md for complete documentation.

  • Version 1.x: Shift to component‑based analysis with improved SMARTS matching and better handling of branched/cyclic structures.

Version 1.x - Component-Based Analysis

  • Replaced chain-finding with connected component analysis
  • Added support for branched and cyclic structures
  • Improved SMARTS pattern matching for diverse PFAS classes

Version 0.x - Path-Based Analysis

  • Find SMARTS match connected to either a second SMARTS or a default path-related SMARTS using networkx shortest_path.

Licence

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Contact me in case you want an exception to the No Derivatives term.

Acknowledgments

This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.

EU logo zeropm logozeropm logo

Powered by RDKit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pfasgroups-2.2.3.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pfasgroups-2.2.3-py3-none-any.whl (81.9 kB view details)

Uploaded Python 3

File details

Details for the file pfasgroups-2.2.3.tar.gz.

File metadata

  • Download URL: pfasgroups-2.2.3.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pfasgroups-2.2.3.tar.gz
Algorithm Hash digest
SHA256 aa2aaf1473aaa9c602d84acfacf52b53748819c357ce51cd8eba1f6595c8b255
MD5 e3fd5ac7690f5a3de2f53bc460ec4bb9
BLAKE2b-256 8663b977cabf5f4849be8f69f27f8377f7df6b7a5a6444f58e94386dbbfcfb70

See more details on using hashes here.

File details

Details for the file pfasgroups-2.2.3-py3-none-any.whl.

File metadata

  • Download URL: pfasgroups-2.2.3-py3-none-any.whl
  • Upload date:
  • Size: 81.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pfasgroups-2.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8c4017900780dc67d0614891fdc43362070b07668ec188b1ea021c5d3f7b3245
MD5 923ed80aec2b25716e787edae2691625
BLAKE2b-256 b41da9ea7f72d8f02ae641c8238872c665e8fbf8c1639f640b8390369274d1db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page