Skip to main content

Extract and summarize medical exam reports (X-rays, MRIs, ultrasounds, etc.) with AI precision

Project description

parsemedicalexams

parsemedicalexams

License Python OpenRouter

๐Ÿฅ Extract and summarize medical exam reports from PDFs using Vision AI ๐Ÿ“„

Features ยท Quick Start ยท Configuration ยท Output Format


Features

  • Vision-powered extraction โ€” Uses Vision LLMs to read X-rays, MRIs, ultrasounds, endoscopies, and more directly from PDF scans
  • Self-consistency voting โ€” Runs multiple extractions and votes on the best result for maximum reliability
  • Intelligent classification โ€” Automatically categorizes exams (imaging, ultrasound, endoscopy, other) and standardizes naming
  • Clinical summarization โ€” Preserves all findings, impressions, and recommendations while filtering noise
  • Markdown output with YAML frontmatter โ€” Clean, structured files ready for Obsidian, static sites, or further processing
  • Smart caching โ€” Persistent JSON caches avoid redundant API calls and allow manual overrides
  • Multi-era document handling โ€” Frequency-based date voting correctly handles documents spanning multiple time periods

Quick Start

1. Install

pip install -e .

Requires Poppler for PDF processing:

  • macOS: brew install poppler
  • Ubuntu: apt-get install poppler-utils

2. Configure

cp .env.example .env

Edit .env with your settings:

OPENROUTER_API_KEY=your_api_key_here
INPUT_PATH=/path/to/your/exam/pdfs
OUTPUT_PATH=/path/to/output

3. Run

python main.py

How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  PDF Input  โ”‚โ”€โ”€โ”€โ–ถโ”‚  Preprocessing  โ”‚โ”€โ”€โ”€โ–ถโ”‚ Vision LLM ร—N  โ”‚โ”€โ”€โ”€โ–ถโ”‚ Standardize  โ”‚โ”€โ”€โ”€โ–ถโ”‚  Markdown  โ”‚
โ”‚             โ”‚    โ”‚  (grayscale,    โ”‚    โ”‚  + voting      โ”‚    โ”‚  + classify  โ”‚    โ”‚   Output   โ”‚
โ”‚             โ”‚    โ”‚   resize)       โ”‚    โ”‚                โ”‚    โ”‚              โ”‚    โ”‚            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  1. PDF โ†’ Images โ€” Converts each page to grayscale, resizes, and enhances contrast
  2. Document classification โ€” Determines if the document is a medical exam before processing
  3. Vision LLM transcription โ€” Transcribes each page verbatim using function calling (runs N times for reliability)
  4. Self-consistency voting โ€” If transcriptions differ, LLM votes on the best result
  5. Standardization โ€” Classifies exam type and standardizes the name via LLM with caching
  6. Summarization โ€” Generates document-level clinical summaries preserving all findings

Configuration

Environment Variables

Variable Description Default
OPENROUTER_API_KEY Your OpenRouter API key (get one here) required
INPUT_PATH Directory containing exam PDFs required
OUTPUT_PATH Where to write output files required
EXTRACT_MODEL_ID Vision model for extraction google/gemini-2.5-flash
SUMMARIZE_MODEL_ID Model for summarization google/gemini-2.5-flash
SELF_CONSISTENCY_MODEL_ID Model for voting google/gemini-2.5-flash
N_EXTRACTIONS Number of extraction runs for voting 3
MAX_WORKERS Parallel workers for PDF processing 1
INPUT_FILE_REGEX Regex pattern for input files .*\.pdf

Using Profiles

Profiles let you save different input/output configurations for different use cases:

# Create a profile from template
cp profiles/_template.yaml profiles/myprofile.yaml

# Run with profile
python main.py --profile myprofile

# List available profiles
python main.py --list-profiles

Profile files (YAML or JSON) support path overrides and model configuration:

name: myprofile
input_path: /path/to/input
output_path: /path/to/output
input_file_regex: ".*\\.pdf"
model: google/gemini-2.5-flash  # Optional override
workers: 1                       # Optional override

CLI Options

Option Description
--profile, -p Profile name to use
--list-profiles List available profiles
--regenerate Regenerate markdown files from existing JSON data
--reprocess-all Force reprocess all documents
--document, -d Process only this document (filename or stem)
--page Process only this page number (requires --document)
--model, -m Override model ID
--workers, -w Override worker count
--pattern Override input file regex

Examples:

# Process all new PDFs
python main.py --profile tsilva

# Regenerate summaries from existing transcription files
python main.py --profile tsilva --regenerate

# Force reprocess all documents
python main.py --profile tsilva --reprocess-all

# Reprocess a specific document
python main.py -p tsilva -d exam_2024.pdf

# Reprocess a specific page within a document
python main.py -p tsilva -d exam_2024.pdf --page 2

Output Format

The parser generates structured markdown files with YAML frontmatter:

output/
โ”œโ”€โ”€ {document}/
โ”‚   โ”œโ”€โ”€ {document}.pdf            # Source PDF copy
โ”‚   โ”œโ”€โ”€ {document}.001.jpg        # Page 1 image
โ”‚   โ”œโ”€โ”€ {document}.001.md         # Page 1 transcription + metadata
โ”‚   โ”œโ”€โ”€ {document}.002.jpg        # Page 2 image
โ”‚   โ”œโ”€โ”€ {document}.002.md         # Page 2 transcription + metadata
โ”‚   โ””โ”€โ”€ {document}.summary.md     # Document-level summary

Transcription File Structure

Each .md file contains YAML frontmatter with metadata followed by the verbatim transcription:

---
date: 2024-01-15
title: "Chest X-Ray PA and Lateral"
category: imaging
exam_name_raw: "RX TORAX PA Y LAT"
doctor: "Dr. Smith"
facility: "Hospital Central"
confidence: 0.95
page: 1
source: exam_2024.pdf
---

[Full verbatim transcription text here...]

Metadata Fields

Field Description
date Exam date (YYYY-MM-DD)
title Standardized exam name (English)
category Exam type: imaging, ultrasound, endoscopy, other
exam_name_raw Exam name exactly as written in document
doctor Physician name (if found)
facility Healthcare facility name
department Department within facility
confidence Self-consistency confidence score (0.0-1.0)
page Page number in source PDF
source Source PDF filename

Architecture

parsemedicalexams/
โ”œโ”€โ”€ main.py              # Pipeline orchestration, CLI handling
โ”œโ”€โ”€ extraction.py        # Pydantic models, Vision LLM extraction, voting
โ”œโ”€โ”€ standardization.py   # Exam type classification with JSON cache
โ”œโ”€โ”€ summarization.py     # Document-level clinical summarization
โ”œโ”€โ”€ config.py            # ExtractionConfig (.env) + ProfileConfig (profiles/)
โ”œโ”€โ”€ utils.py             # Image preprocessing, logging, JSON utilities
โ”œโ”€โ”€ prompts/             # Externalized LLM prompts as markdown
โ”œโ”€โ”€ profiles/            # User-specific path configurations
โ””โ”€โ”€ config/cache/        # Persistent LLM response caches (user-editable)

Key Design Patterns

  • Two-phase processing: Classify document first, then transcribe all pages
  • Two-column naming: *_raw (exact from document) + *_standardized (LLM-mapped)
  • Persistent caching: LLM standardization results cached in config/cache/*.json
  • Editable caches: Manually override cached values to fix misclassifications
  • Profile inheritance: Profiles can inherit from .env with overrides
  • Frequency-based date voting: Handles multi-era documents (e.g., 2024 cover letter + 1997 records)

Requirements

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsemedicalexams-0.1.3.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsemedicalexams-0.1.3-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file parsemedicalexams-0.1.3.tar.gz.

File metadata

  • Download URL: parsemedicalexams-0.1.3.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parsemedicalexams-0.1.3.tar.gz
Algorithm Hash digest
SHA256 4db4224ddee16fd2c6b3d9a6946cd2b0b8b784fb52959047b4dc53b06eaed7f4
MD5 ac797b8cd3d34b74e86d755c65073c60
BLAKE2b-256 8b8d575e7f5a4c78919c9e98c8e85d78fb695ab2de47135dcfd5063628939e5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsemedicalexams-0.1.3.tar.gz:

Publisher: release.yml on tsilva/parsemedicalexams

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parsemedicalexams-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for parsemedicalexams-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d373630ac05a311d6a799c3b5eccb35380e8ec636b25be7d386b86396e8858d8
MD5 21e2887284dcb2c13c302c2ac3cc1761
BLAKE2b-256 88e8054c8c444984587d95cb118b62ce90b18c14595053d46be4517f9e6f4c5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsemedicalexams-0.1.3-py3-none-any.whl:

Publisher: release.yml on tsilva/parsemedicalexams

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page