Skip to main content

Signature detection and role attribution for PDFs

Project description

CaseWorks.Automation.CaseDocumentIntake

sigdetect

sigdetect is a small Python library + CLI that detects e-signature evidence in PDFs and infers the signer role (e.g., patient, attorney, representative).

It looks for:

  • Real signature form fields (/Widget annotations with /FT /Sig)
  • AcroForm signature fields present only at the document level
  • Common vendor markers (e.g., DocuSign, “Signature Certificate”)
  • Page labels (like “Signature of Patient” or “Signature of Parent/Guardian”)

It returns a structured summary per file (pages, counts, roles, hints, etc.) that can be used downstream.


Contents


Quick start

Requirements

  • Python 3.9+ (developed & tested on 3.11)
  • macOS / Linux / WSL

Setup

# 1) Create and activate a virtualenv (example uses Python 3.11)
python3.11 -m venv .venv
source .venv/bin/activate

# 2) Install in editable (dev) mode
python -m pip install --upgrade pip
pip install -e .

Sanity check

# Run unit & smoke tests
pytest -q

CLI usage

The project ships a Typer-based CLI (exposed either as sigdetect or runnable via python -m sigdetect.cli, depending on how it is installed).

sigdetect --help
# or
python -m sigdetect.cli --help

Detect (per-file summary)

# Execute detection according to the YAML configuration
sigdetect detect \
  --config ./sample_data/config.yml \
  --profile hipaa            # or: retainer

Notes

  • The config file controls pdf_root, out_dir, engine, pseudo_signatures, recurse_xobjects, etc.
  • --engine supports pypdf2 (default); a pymupdf engine placeholder exists and may be included in a future build.
  • --pseudo-signatures enables a vendor/Acro-only pseudo-signature when no actual /Widget is present (useful for DocuSign / Acrobat Sign receipts).
  • --recurse-xobjects allows scanning Form XObjects for vendor markers and labels embedded in page resources.
  • --profile selects tuned role logic:
    • hipaa → patient / representative / attorney
    • retainer → client / firm (prefers detecting two signatures)
  • If the executable is not on PATH, you can always fall back to python -m sigdetect.cli ....

EDA (quick aggregate stats)

sigdetect eda \
  --config ./sample_data/config.yml

Library usage

from pathlib import Path
from sigdetect.config import DetectConfiguration
from sigdetect.detector.pypdf2_engine import PyPDF2Detector

configuration = DetectConfiguration(
    PdfRoot=Path("/path/to/pdfs"),
    OutputDirectory=Path("./out"),
    Engine="pypdf2",
    PseudoSignatures=True,
    RecurseXObjects=True,
    Profile="retainer",   # or "hipaa"
)

detector = PyPDF2Detector(configuration)
result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
print(result.to_dict())

Detect(Path) returns a FileResult dataclass; call .to_dict() for the JSON-friendly representation (see Result schema).


Library API (embed in another script)

Minimal, plug-and-play API Import from sigdetect.api and get plain dicts out (JSON-ready), with no I/O side effects by default:

from sigdetect.api import DetectPdf, DetectMany, ScanDirectory, ToCsvRow, Version

print("sigdetect", Version())

# 1) Single file → dict
result = DetectPdf(
    "/path/to/file.pdf",
    profileName="retainer",
    includePseudoSignatures=True,
    recurseXObjects=True,
)
print(
    result["file"],
    result["pages"],
    result["esign_found"],
    result["sig_count"],
    result["sig_pages"],
    result["roles"],
    result["hints"],
)


# 2) Directory walk (generator of dicts)
for res in ScanDirectory(
    "/path/to/pdfs",
    profileName="hipaa",
    includePseudoSignatures=True,
    recurseXObjects=True,
):
    # store in DB, print, etc.
    pass

Result schema

High-level summary (per file):

{
  "file": "example.pdf",
  "size_kb": 123.4,
  "pages": 3,
  "esign_found": true,
  "scanned_pdf": false,
  "mixed": false,
  "sig_count": 2,
  "sig_pages": "1,3",
  "roles": "patient;representative",
  "hints": "AcroSig:sig_patient;VendorText:DocuSign\\s+Envelope\\s+ID",
  "signatures": [
    {
      "page": 1,
      "field_name": "sig_patient",
      "role": "patient",
      "score": 5,
      "scores": { "field": 3, "page_label": 2 },
      "evidence": ["field:patient", "page_label:patient"],
      "hint": "AcroSig:sig_patient"
    },
    {
      "page": null,
      "field_name": "vendor_or_acro_detected",
      "role": "representative",
      "score": 6,
      "scores": { "page_label": 4, "general": 2 },
      "evidence": ["page_label:representative(parent/guardian)", "pseudo:true"],
      "hint": "VendorOrAcroOnly"
    }
  ]
}

Field notes

  • esign_found is true if any signature widget, AcroForm /Sig field, or vendor marker is detected.
  • scanned_pdf is a heuristic: pages with images only and no extractable text.
  • mixed means both esign_found and scanned_pdf are true.
  • roles summarizes unique non-unknown roles across signatures.
  • In retainer profile, emitter prefers two signatures (client + firm), often on the same page.

Configuration & rules

Built-in rules live under src/sigdetect/data/:

  • vendor_patterns.yml – vendor byte/text patterns (e.g., DocuSign, Acrobat Sign).
  • role_rules.yml – signer-role logic:
    • labels – strong page labels (e.g., “Signature of Patient”, including Parent/Guardian cases)
    • general – weaker role hints in surrounding text
    • field_hints – field-name keywords (e.g., sig_patient)
    • doc_hard – strong document-level triggers (relationship to patient, “minor/unable to sign”, first-person consent)
    • weights – scoring weights for the above
  • role_rules.retainer.yml – retainer-specific rules (labels for client/firm, general tokens, and field hints).

You can keep one config YAML per dataset, e.g.:

# ./sample_data/config.yml (example)
pdf_root: ./pdfs
out_dir: ./sigdetect_out
engine: pypdf2
pseudo_signatures: true
recurse_xobjects: true
profile: retainer    # or: hipaa

YAML files can be customized or load at runtime (see CLI --config, if available, or import and pass patterns into engine).

Key detection behaviors

  • Widget-first in mixed docs: if a real /Widget exists, no pseudo “VendorOrAcroOnly” signature is emitted.
  • Acro-only dedupe: multiple /Sig fields at the document level collapse to a single pseudo signature.
  • Parent/Guardian label: “Signature of Parent/Guardian” maps to the representative role.
  • Field-name fallbacks: role hints are pulled from /T, /TU, or /TM (in that order).
    • Retainer heuristics:
    • Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
    • Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
    • When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.

Smoke tests

Drop-in smoke tests live under tests/ and cover:

  • Vendor-only (multiple markers)
  • Acro-only (single pseudo with multiple /Sig)
  • Mixed (real widget + vendor markers → widget role, no pseudo)
  • Field-name fallbacks (/TU, /TM)
  • Parent/Guardian label → representative
  • Encrypted PDFs (graceful handling)

Run a subset:

pytest -q -k smoke
# or specific files:
pytest -q tests/test_mixed_widget_vendor_smoke.py

Debugging

If you need to debug or inspect the detection logic, you can run the CLI with --debug:

from pathlib import Path
from sigdetect.config import DetectConfiguration
from sigdetect.detector.pypdf2_engine import PyPDF2Detector

pdf = Path("/path/to/one.pdf")
configuration = DetectConfiguration(
    PdfRoot=pdf.parent,
    OutputDirectory=Path("."),
    Engine="pypdf2",
    Profile="retainer",
    PseudoSignatures=True,
    RecurseXObjects=True,
)
print(PyPDF2Detector(configuration).Detect(pdf).to_dict())

Dev workflow

Project layout

src/
  sigdetect/
    detector/
      base.py
      pypdf2_engine.py
    data/
      role_rules.yml
      vendor_patterns.yml
    cli.py
tests/
pyproject.toml
.pre-commit-config.yaml

Formatting & linting (pre-commit)

# one-time
pip install pre-commit
pre-commit install

# run on all files
pre-commit run --all-files

Hooks: black, isort, ruff, plus pytest (optional).
Ensure your virtualenv folders are excluded in .pre-commit-config.yaml (e.g., ^\.venv).

Typical loop

# run tests
pytest -q

# run only smoke tests while iterating
pytest -q -k smoke

Troubleshooting

Using the wrong Python

which python
python -V

If you see 3.8 or system Python, recreate the venv with 3.11.

ModuleNotFoundError: typer / click / pytest

pip install typer click pytest

Pre-commit reformats files in .venv

exclude: |
  ^(\.venv|\.venv311|dist|build)/

Vendor markers not detected
Set --recurse-xobjects true and enable pseudo signatures. Many providers embed markers in Form XObjects or compressed streams.

Parent/Guardian not recognized
The rules already include a fallback for “Signature of Parent/Guardian”; if your variant differs, add it to role_rules.yml → labels.representative.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigdetect-0.1.1.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sigdetect-0.1.1-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file sigdetect-0.1.1.tar.gz.

File metadata

  • Download URL: sigdetect-0.1.1.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sigdetect-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d9e667f94714c1b5730c0a5c14014c01e5e2f6a10eddf1d0d3fdec0830f41613
MD5 9b25e36f6d7fbf2886f5e9a63f046527
BLAKE2b-256 eb420eefe20835cd4119aff86b331680081ecc88570b29058164af794a527631

See more details on using hashes here.

Provenance

The following attestation bundles were made for sigdetect-0.1.1.tar.gz:

Publisher: release.yml on Angeion-Group/sigdetect-hipaa-retainer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sigdetect-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sigdetect-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sigdetect-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 153d38710e36fc8f8abb292274a1d901849cad9e452cb9f12729ff3ecfab7e71
MD5 e0f3b28d5dceadc5947fcb9ecea8d865
BLAKE2b-256 86c0243a6892ad3a456920e3eba332acf8fd1c40bdaa3802f78587b670b80175

See more details on using hashes here.

Provenance

The following attestation bundles were made for sigdetect-0.1.1-py3-none-any.whl:

Publisher: release.yml on Angeion-Group/sigdetect-hipaa-retainer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page