Skip to main content

The Digital Registrar — a schema-first framework for multi-cancer, privacy-preserving pathology abstraction via local LLMs.

Project description

Digital Registrar

A schema-first framework for multi-cancer, privacy-preserving pathology abstraction via local LLMs.

DOI License: MIT Python 3.11+

TL;DRpip install digital-registrar-gui && registrar-infer-gui → paste a pathology report → get structured JSON. Requires either a local Ollama server with one of gpt-oss:20b / qwen3:30b / gemma3:27b pulled, or an OPENAI_API_KEY. See Quickstart.

Digital Registrar transforms free-text surgical pathology reports into machine-readable registry records using a College of American Pathologists (CAP)-aligned clinical ontology, encoded as strictly-typed DSPy signatures. The system covers 10 major cancer types across 192 per-organ registry field cells (60 unique field names) — including complex variable-length structures like lymph-node groups and surgical margins — and is model-agnostic: any local LLM can serve as the inference engine. Designed for on-premise deployment on a single 48 GB GPU, it keeps sensitive clinical text inside the institution.

Highlights

  • Schema-first architecture — the clinical ontology is the durable contribution; LLMs are interchangeable engines.
  • CAP-aligned, registry-grade — 10 cancer types, 192 per-organ field cells (60 unique), validated against gold-standard human annotations.
  • Privacy-preserving by design — local LLMs only, single 48 GB GPU, no cloud round-trip required.
  • Validated generalizability92.0 % macro-mean exact-match on 893 internal reports (10 organs); 77.5 % on the external TCGA cohort of 242 reports — 88.0 % after excluding structurally-silent fields (paper).

Quickstart (end users)

Prerequisites — pick one LLM backend

The toolkit is BYO-LLM. You need either:

  • Local Ollama with one of the three paper-benchmarked models pulled:

    ollama pull gpt-oss:20b      # default — best accuracy on internal validation
    ollama pull qwen3:30b        # alt MoE (Qwen3-30B-A3B)
    ollama pull gemma3:27b       # alt dense
    

    Sized for a single ~48 GB GPU. Smaller VRAM works with quantised tags but is unbenchmarked.

  • OpenAI: set OPENAI_API_KEY in your env (or in ~/.config/digital-registrar/.env), then pick gpt5_4_mini in the GUI's model dropdown.

Run the GUI in 30 seconds

# Path A — local Ollama (default)
pip install digital-registrar-gui
registrar-infer-gui                 # opens http://localhost:8502 with gpt-oss:20b

# Path B — OpenAI
export OPENAI_API_KEY=sk-...
pip install digital-registrar-gui
registrar-infer-gui                 # then change the model selector to gpt5_4_mini

Paste a report (or point at a folder of .txt files) and the structured JSON appears on the right. The expander shows the full DSPy LM trace (router + group extractors).

Other packages

The toolkit ships as four pip-installable packages. Pick the apps you need:

# Inference GUI — paste a report, see the structured extraction
pip install digital-registrar-gui
registrar-infer-gui                 # opens http://localhost:8502

# Annotation tool — review pipeline output against gold
pip install digital-registrar-annotator
registrar-annotate-workspace

# Schema editor — curate the CAP-aligned per-organ schema
pip install digital-registrar-schema-editor
registrar-schema-gui

# Core only (CLI + Python API) — for pipelines, scripts, and downstream tools
pip install digital-registrar
registrar-pipeline --input <folder>

Each app depends on digital-registrar (the core), so installing any of the apps brings the pipeline along automatically.

Audience

Built for cancer registrars, pathology informatics teams, and clinical researchers who need registry-grade structured extraction from narrative pathology reports without sending PHI off-premise.

Repository layout

drr-next/
├── src/digital_registrar/      ← THE core (pipeline, schemas, signatures, eval, paths)
├── apps/
│   ├── infer-gui/              ← digital-registrar-gui (Streamlit inference)
│   ├── schema-editor/          ← digital-registrar-schema-editor
│   └── annotator/              ← digital-registrar-annotator
├── attic/                      ← research scaffolding (benchmarks, ablations, baselines, obfuscator)
├── packaging/                  ← release pipeline (PyInstaller, Docker, hosted demo)
├── workspace/                  ← gitignored runtime data (data, results, runs)
├── examples/                   ← small read-only fixtures
├── tests/                      ← core tests
└── docs/                       ← architecture, API, eval, release

Dev install (cloners)

git clone https://github.com/kblab2024/digitalregistrar.git digitalregistrar 
cd digitalregistrar 
make install-dev      # installs core + 3 apps + dev tooling
make test             # core + app test suites
make lint             # ruff

make install-dev installs the vendored tnmhelper wheel first, then pip install -e . (core), then pip install -e apps/<each> for the three downstream apps. Anyone with pip can clone and install in one command — no uv required.

Public Python API

from digital_registrar import (
    run_pipeline, setup_pipeline,                  # extraction
    load_pydantic_model, load_json_schema,         # schemas
    list_organs, CASE_MODELS, build_case_model,
    build_extraction_signatures, ExtractionStep,   # signatures
    field_metrics, nested_field_metrics,           # eval
    pairwise_compare, completeness, score_case,
    WORKSPACE_ROOT, workspace_root, results_root,  # paths
)

See docs/api.md for the full reference.

Documentation

Topic Where
Pipeline architecture (v1 legacy, v2 factory) docs/architecture/pipeline.md
Three-layer schema architecture docs/architecture/schemas.md
AJCC TNM staging via tnmhelper docs/architecture/staging.md
DSPy deep dive docs/architecture/dspy_deep_dive.md
Annotation workflow docs/workflows/annotation.md
Eval (prediction vs annotation) docs/eval/index.md
Public Python API docs/api.md
Release pipeline (PyPI / hosted demo / bundles / Docker) docs/release.md
Research scaffolding (benchmarks, ablations, obfuscator) attic/README.md

Releasing

The project supports three distribution paths for layman users (see docs/release.md):

  • PyPIpip install digital-registrar-gui for Python users.
  • Hosted Streamlit demo — public URL for paper reviewers / casual visitors. Safety checklist in docs/release.md.
  • Native bundles + Docker.dmg / .exe / Docker images for non-technical end users, built via make bundle and make docker-build.

Citation

If you use the Digital Registrar in your research, please cite:

Chow N-H, Chang H, Chen H-K, et al. Digital Registrar: A Schema-First Framework for Multi-Cancer Privacy-Preserving Pathology Abstraction via Local LLMs. Diagnostics. 2026;16(11):1644. doi: 10.3390/diagnostics16111644

Machine-readable metadata in CITATION.cff.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digital_registrar-0.2.0b3.tar.gz (341.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

digital_registrar-0.2.0b3-py3-none-any.whl (419.8 kB view details)

Uploaded Python 3

File details

Details for the file digital_registrar-0.2.0b3.tar.gz.

File metadata

  • Download URL: digital_registrar-0.2.0b3.tar.gz
  • Upload date:
  • Size: 341.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for digital_registrar-0.2.0b3.tar.gz
Algorithm Hash digest
SHA256 524a02e65eb69dbe702302c4a103f8809e5121bc8e26e5a547c65c6ef982ca4d
MD5 6592559bcc4a9d51da675ce5f3ba752e
BLAKE2b-256 b2e86ced9fa5ba957d63410d504b20d6dc119da8ac1b3d43fbe4fb04f85301d4

See more details on using hashes here.

File details

Details for the file digital_registrar-0.2.0b3-py3-none-any.whl.

File metadata

File hashes

Hashes for digital_registrar-0.2.0b3-py3-none-any.whl
Algorithm Hash digest
SHA256 77c57b03b0313e8d98780c34463218aef448e58a81f6488ea371163b57d3fb6b
MD5 986fa5cd71d473abab7d29b809f8e84f
BLAKE2b-256 f5fdcc0b3a7972ab129ee5709e50a7637bc3caea6dc0b6fd78f40189154e551b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page