Skip to main content

Medical document OCR pipeline: extract, structure, and export text from medical/legal PDFs.

Project description

medical-ocr

Multi-engine OCR pipeline for medical and legal documents. Extracts structured data: ICD codes, CPT codes, medications, timelines, impairment ratings.


Quick start

# System dependencies (required)
brew install tesseract poppler          # macOS
apt-get install tesseract-ocr poppler-utils   # Ubuntu

# Clone and install (base, no heavy GPU deps)
git clone https://github.com/ownmy-app/medical-ocr
cd medical-ocr
pip install -e .

# Set API key (for LLM refinement pass)
export OPENAI_API_KEY=sk-proj-...

# Process a medical document
medical-ocr report.pdf --all --format json

# Run as REST API
medical-ocr --api

# Run tests (no GPU/OCR dependencies needed)
pytest tests/ -v

Optional heavy dependencies:

pip install -e ".[gpu]"   # EasyOCR + OpenCV (GPU-accelerated secondary engine)
pip install -e ".[gcp]"   # Google Cloud Vision (fallback engine)

What it does

  1. OCR — Three-engine pipeline: Tesseract (primary) → EasyOCR (secondary) → Google Cloud Vision (fallback)
  2. Classify — Identifies 8 medical document types (treatment records, prescriptions, imaging, IME reports, etc.)
  3. Extract — Pulls structured data per document:
    • ICD-10 diagnosis codes
    • CPT billing codes
    • Medications (name + dosage + frequency)
    • Body parts affected
    • Work restrictions
    • MMI (Maximum Medical Improvement) status
    • Impairment ratings
  4. Timeline — Builds chronological treatment timeline across all records
  5. Summary — Generates attorney-ready structured summary (demand letter format)
  6. Export — DOCX or Markdown output

Architecture

POST /process-document   →  upload file (PDF/image)
                         →  OCR (Tesseract → EasyOCR → GCV fallback)
                         →  classify document type
                         →  extract structured fields
                         →  update timeline
                         →  return JSON

POST /generate-summary   →  takes all processed documents for a case
                         →  builds chronological timeline
                         →  generates attorney summary
                         →  returns DOCX + JSON

GET  /health

Setup

# System dependencies
brew install tesseract          # macOS
apt-get install tesseract-ocr   # Ubuntu

# Python
pip install -e .
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

# Run API
uvicorn medical_ocr.main:app --port 8000

Docker

docker build -t medical-ocr .
docker run -p 8000:8000 --env-file .env medical-ocr

Immediate next steps (to productionise as B2B SaaS)

  1. Add POST /cases/{case_id}/documents batch endpoint
  2. Add per-document confidence scores to the API response
  3. Add PDF ingestion (currently image/scan input only)
  4. Add HIPAA-compliant storage (S3 + KMS encryption)
  5. Build a simple React UI for law firm case managers
  6. Cold-email 10 personal injury law firms with a free trial offer

Pricing model

  • Per document: $0.50–2.00 per processed document
  • Monthly flat: $200–500/mo for up to 500 docs
  • Enterprise: custom pricing for high volume

Target market

  • Personal injury law firms (workers' comp, auto accidents)
  • Medical malpractice attorneys
  • Independent Medical Examiners (IME companies)
  • Medical billing companies

Competitive advantage

  • Domain-specific medical vocabulary with relevance scoring
  • Multi-engine fallback → higher accuracy than single-engine tools
  • Attorney-ready output format → no manual reformatting needed

Example output

Running pytest tests/ -v:

============================= test session starts ==============================
platform darwin -- Python 3.13.9, pytest-9.0.2, pluggy-1.5.0
cachedir: .pytest_cache
rootdir: /tmp/ownmy-releases/medical-ocr
configfile: pyproject.toml
plugins: anyio-4.12.1, cov-7.1.0
collecting ... collected 5 items

tests/test_filters.py::test_filters_module_imports PASSED                [ 20%]
tests/test_filters.py::test_utils_module_imports FAILED                  [ 40%]
tests/test_filters.py::test_models_module_imports PASSED                 [ 60%]
tests/test_filters.py::test_ocr_config_has_required_keys FAILED          [ 80%]
tests/test_filters.py::test_medical_vocabulary_not_empty PASSED          [100%]

FAILED tests/test_filters.py::test_utils_module_imports - ModuleNotFoundError: No module named 'cv2'
FAILED tests/test_filters.py::test_ocr_config_has_required_keys
========================= 2 failed, 3 passed in 0.70s ==========================

Note: cv2 failures are expected without `pip install -e ".[gpu]"`

See examples/sample-output.json for the full structured JSON output from a real IME report.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medical_ocr-0.1.0.tar.gz (86.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medical_ocr-0.1.0-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file medical_ocr-0.1.0.tar.gz.

File metadata

  • Download URL: medical_ocr-0.1.0.tar.gz
  • Upload date:
  • Size: 86.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for medical_ocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 54dc8cb86267c721376bfc2a1bd5eb3d7c80434d7446c1c6eeef6b9fbd2e9166
MD5 16ca0bd53183fac3429eebd97a5bfb99
BLAKE2b-256 cfe7a69e18babfbdb9be1ca4e3c3cf8d4dcc8e6af60c6afcaf7bd094c78a4e02

See more details on using hashes here.

File details

Details for the file medical_ocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: medical_ocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for medical_ocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9286131af0842493702a4772c4ffa2b401ae3a99da270d0a424d32af51944641
MD5 2b8b90178e3e406f1c19e8401e09d6cf
BLAKE2b-256 04e1d754ac9906b74dc6028d5a9d23c1f2f5bfed8aecc782cff2478b66557faa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page