A comprehensive cheminformatics package for automated detection, classification, and analysis of per- and polyfluoroalkyl substances (PFAS). Combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding to identify 55 PFAS group classifications (28 OECD-defined groups and 27 generic categories) with fluorinated chain length determination.
Project description
PFASgroups
A comprehensive cheminformatics package for automated detection, classification, and analysis of per- and polyfluoroalkyl substances (PFAS) in chemical databases.
Overview
PFASgroups combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding (using RDKit and NetworkX) to identify and classify PFAS compounds. The package enables systematic PFAS universe mapping and environmental monitoring applications.
Key Features
Core Capabilities
- PFAS Group Identification: Automated detection of 113 functional groups:
- 72 non-telomer groups (OECD-defined and generic categories)
- 40 fluorotelomer groups with linker validation (Groups 69-112)
- 1 aggregate pattern-matching group (Group 113: Telomers)
- Atom Reference Requirement: For non-telomer groups, SMARTS patterns must match atoms that are part of or directly connected to the fluorinated component (per/polyfluorinated carbons), respecting the
max_dist_from_CFconstraint - Linker Validation: CH₂-specific validation for 40 fluorotelomer groups to distinguish from direct-attachment analogues. Telomer groups use
linker_smartsto allow functional groups separated from perfluoro chains by non-fluorinated linkers - Aggregate Groups: Pattern-matching groups that collect related PFAS groups via regex (e.g., Group 113 matches all "telomer" groups)
- Component Length Analysis: Quantification of per- and polyfluorinated alkyl components with CF₂ unit counting
- Graph Metrics: Comprehensive structural characterization (branching, eccentricity, diameter, resistance, centrality)
- Customizable Definitions: Easy extension to additional PFAS groups and halogenated chemical classes via JSON configuration
Additional Tools
- Homologue Series Generation: Iterative component shortening to explore theoretical chemical space
- Fingerprint Generation: PFAS fingerprints for machine learning applications
- Visualization: Assign and visualize PFAS groupings
- Multiple Interfaces: Python API, command-line tool, and browser-based JavaScript version (RDKitJS)
- Batch Processing: Efficient analysis of large chemical databases
Installation
Clone the repository and install dependencies:
pip install -e .
After installation, the pfasgroups command will be available in your terminal.
Benchmark Summary (Feb 2026)
Benchmarks were run on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4C/8T), 15.5 GB RAM, Python 3.9.23, RDKit 2025.09.2, NetworkX 3.2.1 (the old version of Python was taken for compatibility with PFAS-atlas).
| Dataset/Profile | Count | Atom range | PFASgroups mean/median (ms) | PFAS-Atlas mean/median (ms) | Relative speed | Notes |
|---|---|---|---|---|---|---|
| OECD reference (real compounds) | 3,414 | Typical small/medium | 19.2 / 14.8 | 38.8 / 37.7 | 2.02x faster | Real-world dataset representing existing compounds. |
| Timing stress-test (full metrics) | 2,500 | 11-625 | 251.8 / 24.4 | 58.5 / 34.2 | 0.23x | Synthetic stress-test with large molecules; heavy-tail runtime. |
| Timing stress-test (no resistance) | 2,500 | 11-619 | 176.7 / 24.8 | N/A | 1.43x faster vs full | Disables effective graph resistance only. |
| Timing stress-test (no metrics) | 2,500 | 11-619 | 97.7 / 19.7 | N/A | 2.58x faster vs full | Disables all component graph metrics. |
Timing profile plots (full vs no resistance vs no metrics):
Disable or limit graph metrics in the Python API:
from PFASgroups import parse_smiles
# Skip all component graph metrics (fastest)
parse_smiles(smiles_list, compute_component_metrics=False)
# Keep metrics but skip effective graph resistance entirely
parse_smiles(smiles_list, limit_effective_graph_resistance=0)
# Compute resistance only for components below a size threshold
parse_smiles(smiles_list, limit_effective_graph_resistance=200)
CLI equivalents:
# Skip all component graph metrics (fastest)
pfasgroups parse --no-component-metrics "C(C(F)(F)F)F"
# Skip effective graph resistance entirely
pfasgroups parse --limit-effective-graph-resistance 0 "C(C(F)(F)F)F"
# Compute resistance only for components below a size threshold
pfasgroups parse --limit-effective-graph-resistance 200 "C(C(F)(F)F)F"
Quick Start
Python API
from PFASgroups import parse_smiles, generate_fingerprint
# Parse PFAS structures
smiles_list = ["C(C(F)(F)F)F", "FC(F)(F)C(F)(F)C(=O)O"]
results = parse_smiles(smiles_list)
# Generate fingerprints
fingerprints, group_info = generate_fingerprint(smiles_list)
Command Line
# Parse SMILES strings
pfasgroups parse "C(C(F)(F)F)F" "FC(F)(F)C(F)(F)C(=O)O"
# Generate fingerprints
pfasgroups fingerprint "C(C(F)(F)F)F" --format dict
# List available PFAS groups
pfasgroups list-groups
Custom Configuration
Use custom pathtype definitions and PFAS groups:
# Load custom files entirely
from PFASgroups import get_componentSmartss, get_PFASGroups, parse_smiles
custom_paths = get_componentSmartss(filename='my_component_smartss.json')
custom_groups = get_PFASGroups(filename='my_groups.json')
results = parse_smiles(
["C(C(F)(F)F)F"],
componentSmartss=custom_paths,
pfas_groups=custom_groups
)
# Or extend defaults with your custom groups
from PFASgroups import get_PFASGroups, PFASGroup, parse_smiles, compile_componentSmarts, get_componentSmartss
# Add custom PFAS groups
groups = get_PFASGroups() # Get defaults
groups.append(PFASGroup(
id=999,
name="My Custom Group",
smarts1="[C](F)(F)F",
smarts2="[N+](=O)[O-]",
componentSmarts="Perfluoroalkyl",
constraints={"nF": [3, None]}
))
results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=groups)
# Custom max_dist_from_CF parameter
# For functional groups without formula constraints, when bycomponent=True,
# the max_dist_from_CF parameter limits the maximum bond distance between
# a functional group match and a fluorinated carbon terminal atom (default: 0)
groups.append(PFASGroup(
id=998,
name="Extended Distance Group",
smarts1="[#6$([#6][OH1])]",
smarts2=None,
componentSmarts=None,
constraints={},
max_dist_from_CF=3 # Allow up to 3 bonds from fluorinated carbon
))
# Add custom path types (e.g., chlorinated analogs)
paths = get_componentSmartss()
paths['Perchlorinated'] = compile_componentSmarts(
"[C;X4](Cl)(Cl)!@!=!#[C;X4](Cl)(Cl)", # component pattern
"[C;X4](Cl)(Cl)Cl" # end pattern
)
results = parse_smiles(["ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"], componentSmartss=paths)
# Via command line
pfasgroups parse --groups-file my_custom_groups.json "C(C(F)(F)F)F"
# List available groups and paths
pfasgroups list-groups
pfasgroups list-paths
Documentation
- USER_GUIDE.md - Complete documentation with examples
- QUICK_REFERENCE.md - Quick reference for common tasks
Usage Examples
See USER_GUIDE.md for comprehensive examples including:
- Basic PFAS parsing and analysis
- Fingerprint generation for machine learning
- Custom configuration files
- Batch processing
- Integration with pandas and scikit-learn
Summary of changes by version
-
Version 2.2 (Feb 2026): Added linked_smarts option to specify a restriction on path between smarts groups and fluorinated component. Added new PFASgroups (telomers). v2.2.2 Fixed telomers and added examples and counter-examples to each PFASgroup. Removed boundary O in fluorinated components (for both Per and Polyalkyl components). v2.2.3 Added resultsModel to offer easier plotting and summarising capabilities for results.
-
Version 2.1 (Jan 2026): Added support for multiple smarts, with individual minimum count, per PFASgroup.
-
Version 2.0 (Jan 2026): Major expansion of graph‑based component metrics, new coverage statistics, schema updates, and richer per‑component outputs.
Version 2.0 (January 2026) - Comprehensive Graph Metrics
Major enhancement adding comprehensive NetworkX graph theory metrics for detailed component analysis:
New Features:
- Component-Level Metrics: Each fluorinated component now includes 15+ graph metrics:
diameterandradius- Graph eccentricity boundscenter,periphery,barycenter- Structural node setseffective_graph_resistance- Sum of resistance distancescomponent_fraction- Fraction of molecule covered by component (includes all attached H, F, Cl, Br, I)- Distance metrics from functional groups to structural features
- Molecular Coverage Metrics: New fraction-based metrics quantify fluorination extent:
mean_component_fraction- Average coverage per componenttotal_components_fraction- Total coverage by union of all components (accounts for overlaps)
- Summary Statistics: Aggregated metrics across all components per PFAS group
- Enhanced Database Models: New
Componentsmodel stores individual component data with all metrics - Improved Analysis: Better understanding of molecular topology, branching, functional group positioning, and fluorination extent
Breaking Changes:
parse_molsoutput now includes additional summary metric fields (mean_diameter,mean_radius, etc.)- Database schema changes require migration (see
DATABASE_MIGRATION_GUIDE.md)
Metrics Explained:
branching(0-1): Measures linearity (1.0 = linear, 0.0 = highly branched) - renamed from "eccentricity"mean_eccentricity,median_eccentricity: Graph-theoretic eccentricity statistics for component nodessmarts_centrality(0-1): Functional group position (1.0 = central, 0.0 = peripheral)component_fraction(0-1): Fraction of total molecule atoms in this component (includes all attached atoms)total_components_fraction(0-1): Fraction of molecule covered by union of all componentsdiameter: Maximum distance between any two atoms in componentradius: Minimum eccentricity across component nodesbarycenter: Nodes minimizing total distance to all other nodescenter: Nodes with minimum eccentricityperiphery: Nodes with maximum eccentricity
See COMPREHENSIVE_METRICS_SUMMARY.md for complete documentation.
- Version 1.x: Shift to component‑based analysis with improved SMARTS matching and better handling of branched/cyclic structures.
Version 1.x - Component-Based Analysis
- Replaced chain-finding with connected component analysis
- Added support for branched and cyclic structures
- Improved SMARTS pattern matching for diverse PFAS classes
Version 0.x - Path-Based Analysis
- Find SMARTS match connected to either a second SMARTS or a default path-related SMARTS using networkx shortest_path.
Licence
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Contact me in case you want an exception to the No Derivatives term.
Acknowledgments
This project is part of the ZeroPM project (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the Department of Environmental Science at Stockholm University.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pfasgroups-2.2.3.tar.gz.
File metadata
- Download URL: pfasgroups-2.2.3.tar.gz
- Upload date:
- Size: 90.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa2aaf1473aaa9c602d84acfacf52b53748819c357ce51cd8eba1f6595c8b255
|
|
| MD5 |
e3fd5ac7690f5a3de2f53bc460ec4bb9
|
|
| BLAKE2b-256 |
8663b977cabf5f4849be8f69f27f8377f7df6b7a5a6444f58e94386dbbfcfb70
|
File details
Details for the file pfasgroups-2.2.3-py3-none-any.whl.
File metadata
- Download URL: pfasgroups-2.2.3-py3-none-any.whl
- Upload date:
- Size: 81.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c4017900780dc67d0614891fdc43362070b07668ec188b1ea021c5d3f7b3245
|
|
| MD5 |
923ed80aec2b25716e787edae2691625
|
|
| BLAKE2b-256 |
b41da9ea7f72d8f02ae641c8238872c665e8fbf8c1639f640b8390369274d1db
|