Chemical structure-label pair extraction from scientific documents.

Project description

structflo.cser

structflo.cser — detection and pairing example

Chemical structure and label extraction from scientific documents.

Installation • Quick Start • Step-by-Step • Matchers • Downstream Processing • Notebooks

structflo.cser extracts chemical structure–label pairs from images and PDF pages. It uses a fine-tuned YOLO detector trained on synthetic chemical structure data to locate structures and compound labels on a page, then pairs them using Learned Pair Scorer (LPS) model or a simpler Hungarian Matcher.

The extracted crops can be passed to any structure-to-SMILES converter (DECIMER, MolScribe) and any OCR engine for label text. DECIMER and EasyOCR are bundled for convenience, but any downstream tools can be swapped in.

Two-step process:

Detect — A fine-tuned YOLO detector finds all chemical structures and compound labels in the image
Match — A matcher pairs each structure with its corresponding label, producing cropped image pairs

	`LearnedMatcher` (default)	`HungarianMatcher`
Approach	Neural Pair Scorer (LPS)	Geometric (centroid distance)
Setup	Auto-downloads weights	Zero config
Speed	Fast (GPU accelerated)	Instantaneous
Accuracy	Better for complex or crowded pages	Good for simple layouts
Output	`CompoundPair`	`CompoundPair` (identical)

Installation

pip install structflo-cser

# or with uv
uv add structflo-cser

This also installs DECIMER and EasyOCR for downstream SMILES and text extraction. The core pipeline does not depend on them — any extractor implementation can be swapped in.

Quick Start

One call from image to (SMILES, label) pairs:

from structflo.cser.pipeline import ChemPipeline
from structflo.cser.lps import LearnedMatcher

pipeline = ChemPipeline(matcher=LearnedMatcher())
results = pipeline.process("page.png")

for pair in results:
    print(pair.smiles, pair.label_text)

Weights for both the detector and the LPS are auto-downloaded from HuggingFace Hub on first use.

Export to a pandas DataFrame or JSON:

df   = ChemPipeline.to_dataframe(results)
data = ChemPipeline.to_json(results)

   match_distance  match_confidence                              smiles     label_text
0          135.19            0.9844  CN1CCC2=C(C1)SC(=N2)C(=O)NC3=...      7178-39-6
1          208.40            0.9973  C1=CC(=CC=C1C2=C(C(=O)O)N=NN2...     72804-12-9
2          126.25            0.9997  COC1=CC=C(C=C1)C=C2C(=O)N(C3=...   ZINC2978 720

PDF input

For PDFs, use process_pdf() — it renders each page and returns one result list per page:

from structflo.cser.pipeline import ChemPipeline
from structflo.cser.lps import LearnedMatcher

pipeline = ChemPipeline(matcher=LearnedMatcher())

# Returns list[list[CompoundPair]] — one inner list per page
all_pages = pipeline.process_pdf("paper.pdf")

for page_num, pairs in enumerate(all_pages):
    print(f"Page {page_num + 1}: {len(pairs)} compound pairs")
    for pair in pairs:
        print(f"  {pair.label_text:20s}  {pair.smiles}")

Pass output_pdf to save an annotated copy with bounding boxes and extracted data overlaid:

pipeline.process_pdf("paper.pdf", output_pdf="paper_annotated.pdf")

Step-by-Step Pipeline

For finer control, each stage is exposed individually.

1. Create the pipeline

from structflo.cser.pipeline import ChemPipeline

# Default: LearnedMatcher — auto-downloads LPS weights on first use
pipeline = ChemPipeline(tile=False, conf=0.70)

For a heuristic based approach, use HungarianMatcher:

from structflo.cser.pipeline import ChemPipeline, HungarianMatcher

pipeline = ChemPipeline(
    tile=False,
    conf=0.70,
    matcher=HungarianMatcher(max_distance=500),
)

The pipeline is lazy — detector weights, DECIMER, and EasyOCR are loaded on first use only.

2. Detect

detections = pipeline.detect("page.png")

n_struct = sum(1 for d in detections if d.class_id == 0)
n_label  = sum(1 for d in detections if d.class_id == 1)
print(f"Found {n_struct} structures and {n_label} labels")
# Found 6 structures and 6 labels

class_id=0 = chemical structure | class_id=1 = compound label

3. Match

pairs = pipeline.match(detections)
# Matched 6 structure–label pairs
#   Pair 0: distance=135px  structure@(490,421)  label@(489,285)
#   Pair 1: distance=208px  structure@(258,194)  label@(466,195)

4. Visualise

from structflo.cser.viz import plot_detections, plot_pairs, plot_crops, plot_results

fig = plot_detections(img, detections)   # green = structure, blue = label
fig = plot_pairs(img, pairs)             # orange lines connect matched pairs
fig = plot_crops(img, pairs)             # cropped structure and label regions
fig = plot_results(img, results)         # final annotated output

Detection and pairing visualisation

5. Enrich — SMILES and label text

enriched = pipeline.enrich(pairs, "page.png")

for i, p in enumerate(enriched):
    print(f"Pair {i}:")
    print(f"  SMILES:     {p.smiles}")
    print(f"  Label text: {p.label_text}")

Pair 0:
  SMILES:     CN1CCC2=C(C1)SC(=N2)C(=O)NC3=C(C=CC=C3)CNC(=O)C4=CC=CC(=C4)Cl
  Label text: 7178-39-6

Pair 1:
  SMILES:     C1=CC(=CC=C1C2=C(C(=O)O)N=NN2C3=CC=C(C=C3)S(=O)(=O)N)Br
  Label text: 72804-12-9

Matchers

Learned Pair Scorer — `LearnedMatcher` (default)

A neural matcher trained to score structure–label compatibility using both visual crops and geometric features. It replaces the raw distance cost matrix with a learned association probability, then solves global assignment with the Hungarian algorithm.

Weights are auto-downloaded from HuggingFace Hub on first use — no manual setup needed. Models are hosted at:

Detector: huggingface.co/sidxz/structflo-cser-detector
LPS scorer: huggingface.co/sidxz/structflo-cser-lps

from structflo.cser.pipeline import ChemPipeline
from structflo.cser.lps import LearnedMatcher

pipeline = ChemPipeline(
    matcher=LearnedMatcher(
        min_score=0.5,      # drop pairs below this confidence
        max_dist_px=None,   # optional centroid pre-filter to save compute
    )
)

min_score — pairs scoring below this threshold are discarded as unlabelled structures.

Hungarian Matcher — `HungarianMatcher` (fallback)

Pairs structures and labels by minimising total centroid-to-centroid distance. Zero config, zero weights download. Useful for simple document layouts or as a fast sanity check.

from structflo.cser.pipeline import ChemPipeline, HungarianMatcher

pipeline = ChemPipeline(
    matcher=HungarianMatcher(max_distance=500),
)

max_distance — maximum pixel distance for a valid pair. Increase for large pages; reduce to avoid false pairings on dense layouts.

Downstream Processing

structflo.cser outputs cropped image pairs. Plug in any converter for SMILES and any OCR for label text.

SMILES extraction

DECIMER is bundled by default. Swap for MolScribe or any custom BaseSmilesExtractor:

from structflo.cser.pipeline.smiles_extractor import BaseSmilesExtractor

class MyExtractor(BaseSmilesExtractor):
    def extract(self, image) -> str:
        return my_model.predict(image)

pipeline = ChemPipeline(smiles_extractor=MyExtractor())

OCR

EasyOCR is bundled by default. Swap for any custom BaseOCR:

from structflo.cser.pipeline.ocr import BaseOCR

class MyOCR(BaseOCR):
    def extract(self, image) -> str:
        return my_ocr.read(image)

pipeline = ChemPipeline(ocr=MyOCR())

CLI

Run extraction directly from the terminal:

# Detect and pair structures/labels in a directory of images
sf-detect --image_dir data/test_images/ --conf 0.60 --no_tile --pair --max_dist 500

# Full pipeline: detect → match → SMILES + OCR
sf-extract page.png

All available commands:

Command	Description
`sf-detect`	Run YOLO detection on images
`sf-extract`	Full pipeline: detect → match → extract
`sf-generate`	Generate synthetic training data
`sf-train`	Train the YOLO detection model
`sf-train-lps`	Train the Learned Pair Scorer
`sf-eval-lps`	Evaluate LPS on a test set
`sf-fetch-smiles`	Download SMILES from ChEMBL
`sf-download-distractors`	Download distractor images for generation
`sf-annotate`	Launch the web annotation server

Notebooks

Notebook	Description
01-quickstart.ipynb	Step-by-step pipeline walkthrough: detect → match → enrich, then one-call convenience API
02-LPS.ipynb	Using the Learned Pair Scorer for improved matching on complex document pages

License

Apache License 2.0

Project details

Release history Release notifications | RSS feed

This version

0.4.1

Jun 5, 2026

0.4.0

Jun 1, 2026

0.3.0

May 29, 2026

0.2.0

Feb 26, 2026

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structflo_cser-0.4.1.tar.gz (7.5 MB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

structflo_cser-0.4.1-py3-none-any.whl (127.9 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file structflo_cser-0.4.1.tar.gz.

File metadata

Download URL: structflo_cser-0.4.1.tar.gz
Upload date: Jun 5, 2026
Size: 7.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for structflo_cser-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`7cce171473a433f752620eaac4119c010170ac4603ec01e0bb51ba67ecab0db2`
MD5	`72599a67aff72a833ac7f6c543c9a110`
BLAKE2b-256	`bd8befe6cd585e7e581276d69c299f5c76dac0f2dff439ea0486e3d7a8470792`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structflo_cser-0.4.1.tar.gz:

Publisher: publish.yml on structflo/structflo-cser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structflo_cser-0.4.1.tar.gz
- Subject digest: 7cce171473a433f752620eaac4119c010170ac4603ec01e0bb51ba67ecab0db2
- Sigstore transparency entry: 1727214560
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: structflo/structflo-cser@1b25ef6c4178665be51d6d1c9b888aec38d03cfd
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/structflo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1b25ef6c4178665be51d6d1c9b888aec38d03cfd
- Trigger Event: push

File details

Details for the file structflo_cser-0.4.1-py3-none-any.whl.

File metadata

Download URL: structflo_cser-0.4.1-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 127.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for structflo_cser-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bf1fb771d43bac470febd7d506a647f60a8f6621ca1999e9da1c900b871e46c`
MD5	`73846a916ed427840c59cf5b06705ef1`
BLAKE2b-256	`bcb343424d21ca2eeb10ced4f148bfd6d446a1917135734613336068b96001ad`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structflo_cser-0.4.1-py3-none-any.whl:

Publisher: publish.yml on structflo/structflo-cser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structflo_cser-0.4.1-py3-none-any.whl
- Subject digest: 9bf1fb771d43bac470febd7d506a647f60a8f6621ca1999e9da1c900b871e46c
- Sigstore transparency entry: 1727214650
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: structflo/structflo-cser@1b25ef6c4178665be51d6d1c9b888aec38d03cfd
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/structflo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1b25ef6c4178665be51d6d1c9b888aec38d03cfd
- Trigger Event: push

structflo-cser 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

structflo.cser

Installation

Quick Start

PDF input

Step-by-Step Pipeline

1. Create the pipeline

2. Detect

3. Match

4. Visualise

5. Enrich — SMILES and label text

Matchers

Learned Pair Scorer — LearnedMatcher (default)

Hungarian Matcher — HungarianMatcher (fallback)

Downstream Processing

SMILES extraction

OCR

CLI

Notebooks

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Learned Pair Scorer — `LearnedMatcher` (default)

Hungarian Matcher — `HungarianMatcher` (fallback)