Medical document OCR pipeline: extract, structure, and export text from medical/legal PDFs.

These details have not been verified by PyPI

Project links

Repository

Project description

medical-ocr

Multi-engine OCR pipeline for medical and legal documents. Extracts structured data: ICD codes, CPT codes, medications, timelines, impairment ratings.

Quick start

# System dependencies (required)
brew install tesseract poppler          # macOS
apt-get install tesseract-ocr poppler-utils   # Ubuntu

# Clone and install (base, no heavy GPU deps)
git clone https://github.com/nometria/medical-ocr
cd medical-ocr
pip install -e .

# Set API key (for LLM refinement pass)
export OPENAI_API_KEY=sk-proj-...

# Process a medical document
medical-ocr report.pdf --all --format json

# Run as REST API
medical-ocr --api

# Run tests (no GPU/OCR dependencies needed)
pytest tests/ -v

Optional heavy dependencies:

pip install -e ".[gpu]"   # EasyOCR + OpenCV (GPU-accelerated secondary engine)
pip install -e ".[gcp]"   # Google Cloud Vision (fallback engine)

What it does

OCR — Three-engine pipeline: Tesseract (primary) → EasyOCR (secondary) → Google Cloud Vision (fallback)
Classify — Identifies 8 medical document types (treatment records, prescriptions, imaging, IME reports, etc.)
Extract — Pulls structured data per document:
- ICD-10 diagnosis codes
- CPT billing codes
- Medications (name + dosage + frequency)
- Body parts affected
- Work restrictions
- MMI (Maximum Medical Improvement) status
- Impairment ratings
Timeline — Builds chronological treatment timeline across all records
Summary — Generates attorney-ready structured summary (demand letter format)
Export — DOCX or Markdown output

Architecture

POST /extract_file                  →  upload single file (PDF or image)
                                    →  OCR (Tesseract → EasyOCR → GCV fallback)
                                    →  return per-page text + quality metrics

POST /extract_from_doc              →  upload file(s) with structured extraction
                                    →  OCR + classify document type
                                    →  extract structured fields
                                    →  generate timeline + summary
                                    →  return JSON

POST /cases/{case_id}/documents     →  batch upload multiple files for a case
                                    →  process each through the OCR pipeline
                                    →  return array of results with confidence scores
                                    →  track by case_id

GET  /cases/{case_id}/documents     →  retrieve all processed documents for a case

POST /ai/invoke                     →  LLM invocation for PI assessment
POST /ai/generate-image             →  image generation via DALL-E

Setup

# System dependencies
brew install tesseract          # macOS
apt-get install tesseract-ocr   # Ubuntu

# Python
pip install -e .
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

# Run API
uvicorn medical_ocr.main:app --port 8000

Docker

docker build -t medical-ocr .
docker run -p 8000:8000 --env-file .env medical-ocr

Batch endpoint

Upload multiple documents for a case in a single request:

curl -X POST http://localhost:9080/cases/case-001/documents \
  -F "files=@report1.pdf" \
  -F "files=@report2.pdf" \
  -F "files=@scan.jpg"

Response includes per-document results with confidence scores:

{
  "case_id": "case-001",
  "documents_processed": 3,
  "results": [
    {
      "document_id": "...",
      "filename": "report1.pdf",
      "total_pages": 4,
      "text": "...",
      "confidence": {
        "overall": 0.87,
        "per_page": [
          {"page": 1, "confidence": 0.91, "quality_score": 0.85, "engine_used": "advanced_fusion"},
          {"page": 2, "confidence": 0.83, "quality_score": 0.80, "engine_used": "tesseract_v1"}
        ]
      }
    }
  ]
}

Retrieve all processed documents for a case:

curl http://localhost:9080/cases/case-001/documents

Confidence scoring

Every OCR result now includes confidence scores at two levels:

Per-page confidence — from the OCR engine (Tesseract word-level confidence averaged, or quality heuristics when engine data is unavailable)
Overall document confidence — average of all page confidences

Confidence is available in:

The batch endpoint response (confidence.overall, confidence.per_page)
The OCR metadata (_metadata.page_metrics.{page}.confidence)
The pipeline summary via generate_summary_with_confidence()

Supported file formats

Format	Extensions	Notes
PDF	`.pdf`	Multi-page supported via pdf2image/poppler
PNG	`.png`	Single image
JPEG	`.jpg`, `.jpeg`	Single image
TIFF	`.tiff`, `.tif`	Multi-frame TIFFs expanded into separate pages
BMP	`.bmp`	Single image
WebP	`.webp`	Single image

PDF files are rasterised at 300 DPI by default. Image files are loaded directly via Pillow.

Immediate next steps (to productionise as B2B SaaS)

~~Add POST /cases/{case_id}/documents batch endpoint~~ Done
~~Add per-document confidence scores to the API response~~ Done
~~Add PDF ingestion (currently image/scan input only)~~ Done (PDF + image)
Add HIPAA-compliant storage (S3 + KMS encryption)
Build a simple React UI for law firm case managers
Cold-email 10 personal injury law firms with a free trial offer

Pricing model

Per document: $0.50–2.00 per processed document
Monthly flat: $200–500/mo for up to 500 docs
Enterprise: custom pricing for high volume

Target market

Personal injury law firms (workers' comp, auto accidents)
Medical malpractice attorneys
Independent Medical Examiners (IME companies)
Medical billing companies

Competitive advantage

Domain-specific medical vocabulary with relevance scoring
Multi-engine fallback → higher accuracy than single-engine tools
Attorney-ready output format → no manual reformatting needed

Example output

Running pytest tests/ -v:

============================= test session starts ==============================
platform darwin -- Python 3.13.9, pytest-9.0.2, pluggy-1.5.0
cachedir: .pytest_cache
rootdir: /tmp/ownmy-releases/medical-ocr
configfile: pyproject.toml
plugins: anyio-4.12.1, cov-7.1.0
collecting ... collected 5 items

tests/test_filters.py::test_filters_module_imports PASSED                [ 20%]
tests/test_filters.py::test_utils_module_imports FAILED                  [ 40%]
tests/test_filters.py::test_models_module_imports PASSED                 [ 60%]
tests/test_filters.py::test_ocr_config_has_required_keys FAILED          [ 80%]
tests/test_filters.py::test_medical_vocabulary_not_empty PASSED          [100%]

FAILED tests/test_filters.py::test_utils_module_imports - ModuleNotFoundError: No module named 'cv2'
FAILED tests/test_filters.py::test_ocr_config_has_required_keys
========================= 2 failed, 3 passed in 0.70s ==========================

Note: cv2 failures are expected without `pip install -e ".[gpu]"`

See examples/sample-output.json for the full structured JSON output from a real IME report.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.2.0

Mar 27, 2026

0.1.2

Mar 26, 2026

0.1.1

Mar 23, 2026

0.1.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medical_ocr-0.2.0.tar.gz (54.1 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

medical_ocr-0.2.0-py3-none-any.whl (58.7 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file medical_ocr-0.2.0.tar.gz.

File metadata

Download URL: medical_ocr-0.2.0.tar.gz
Upload date: Mar 27, 2026
Size: 54.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for medical_ocr-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e2e446b304b1238740d91a59dcb03487a2a33c9e85a3805a1c8727d2c0d78ffd`
MD5	`741d68d1a96c24fe8fdf5565d55f8864`
BLAKE2b-256	`ce4911bff5f05520591ecf1487897cdb6b4860b55a0c512183c949df1dd7ee27`

See more details on using hashes here.

File details

Details for the file medical_ocr-0.2.0-py3-none-any.whl.

File metadata

Download URL: medical_ocr-0.2.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 58.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for medical_ocr-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a98a4d4182d267226ce876f0d652df7fa05267ee01b3d15b2a187cc1bc372657`
MD5	`bd70fccdcc226bf4d17a4571b54449ee`
BLAKE2b-256	`20d24fc08365219af8962c19f051d5010440eab8988be42980bbd01e502b3e14`

See more details on using hashes here.

medical-ocr 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

medical-ocr

Quick start

What it does

Architecture

Setup

Docker

Batch endpoint

Confidence scoring

Supported file formats

Immediate next steps (to productionise as B2B SaaS)

Pricing model

Target market

Competitive advantage

Example output

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes