Extract and summarize medical exam reports (X-rays, MRIs, ultrasounds, etc.) with AI precision
Project description
parsemedicalexams
๐ฅ Extract and summarize medical exam reports from PDFs using Vision AI ๐
Features ยท Quick Start ยท Configuration ยท Output Format
Features
- Vision-powered extraction โ Uses Vision LLMs to read X-rays, MRIs, ultrasounds, endoscopies, and more directly from PDF scans
- Self-consistency voting โ Runs multiple extractions and votes on the best result for maximum reliability
- Intelligent classification โ Automatically categorizes exams (imaging, ultrasound, endoscopy, other) and standardizes naming
- Clinical summarization โ Preserves all findings, impressions, and recommendations while filtering noise
- Markdown output with YAML frontmatter โ Clean, structured files ready for Obsidian, static sites, or further processing
- Smart caching โ Persistent JSON caches avoid redundant API calls and allow manual overrides
- Multi-era document handling โ Frequency-based date voting correctly handles documents spanning multiple time periods
Quick Start
1. Install
pip install -e .
Requires Poppler for PDF processing:
- macOS:
brew install poppler- Ubuntu:
apt-get install poppler-utils
2. Configure
cp .env.example .env
Edit .env with your settings:
OPENROUTER_API_KEY=your_api_key_here
INPUT_PATH=/path/to/your/exam/pdfs
OUTPUT_PATH=/path/to/output
3. Run
python main.py
How It Works
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ PDF Input โโโโโถโ Preprocessing โโโโโถโ Vision LLM รN โโโโโถโ Standardize โโโโโถโ Markdown โ
โ โ โ (grayscale, โ โ + voting โ โ + classify โ โ Output โ
โ โ โ resize) โ โ โ โ โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
- PDF โ Images โ Converts each page to grayscale, resizes, and enhances contrast
- Document classification โ Determines if the document is a medical exam before processing
- Vision LLM transcription โ Transcribes each page verbatim using function calling (runs N times for reliability)
- Self-consistency voting โ If transcriptions differ, LLM votes on the best result
- Standardization โ Classifies exam type and standardizes the name via LLM with caching
- Summarization โ Generates document-level clinical summaries preserving all findings
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
OPENROUTER_API_KEY |
Your OpenRouter API key (get one here) | required |
INPUT_PATH |
Directory containing exam PDFs | required |
OUTPUT_PATH |
Where to write output files | required |
EXTRACT_MODEL_ID |
Vision model for extraction | google/gemini-2.5-flash |
SUMMARIZE_MODEL_ID |
Model for summarization | google/gemini-2.5-flash |
SELF_CONSISTENCY_MODEL_ID |
Model for voting | google/gemini-2.5-flash |
N_EXTRACTIONS |
Number of extraction runs for voting | 3 |
MAX_WORKERS |
Parallel workers for PDF processing | 1 |
INPUT_FILE_REGEX |
Regex pattern for input files | .*\.pdf |
Using Profiles
Profiles let you save different input/output configurations for different use cases:
# Create a profile from template
cp profiles/_template.yaml profiles/myprofile.yaml
# Run with profile
python main.py --profile myprofile
# List available profiles
python main.py --list-profiles
Profile files (YAML or JSON) support path overrides and model configuration:
name: myprofile
input_path: /path/to/input
output_path: /path/to/output
input_file_regex: ".*\\.pdf"
model: google/gemini-2.5-flash # Optional override
workers: 1 # Optional override
CLI Options
| Option | Description |
|---|---|
--profile, -p |
Profile name to use |
--list-profiles |
List available profiles |
--regenerate |
Regenerate markdown files from existing JSON data |
--reprocess-all |
Force reprocess all documents |
--document, -d |
Process only this document (filename or stem) |
--page |
Process only this page number (requires --document) |
--model, -m |
Override model ID |
--workers, -w |
Override worker count |
--pattern |
Override input file regex |
Examples:
# Process all new PDFs
python main.py --profile tsilva
# Regenerate summaries from existing transcription files
python main.py --profile tsilva --regenerate
# Force reprocess all documents
python main.py --profile tsilva --reprocess-all
# Reprocess a specific document
python main.py -p tsilva -d exam_2024.pdf
# Reprocess a specific page within a document
python main.py -p tsilva -d exam_2024.pdf --page 2
Output Format
The parser generates structured markdown files with YAML frontmatter:
output/
โโโ {document}/
โ โโโ {document}.pdf # Source PDF copy
โ โโโ {document}.001.jpg # Page 1 image
โ โโโ {document}.001.md # Page 1 transcription + metadata
โ โโโ {document}.002.jpg # Page 2 image
โ โโโ {document}.002.md # Page 2 transcription + metadata
โ โโโ {document}.summary.md # Document-level summary
Transcription File Structure
Each .md file contains YAML frontmatter with metadata followed by the verbatim transcription:
---
date: 2024-01-15
title: "Chest X-Ray PA and Lateral"
category: imaging
exam_name_raw: "RX TORAX PA Y LAT"
doctor: "Dr. Smith"
facility: "Hospital Central"
confidence: 0.95
page: 1
source: exam_2024.pdf
---
[Full verbatim transcription text here...]
Metadata Fields
| Field | Description |
|---|---|
date |
Exam date (YYYY-MM-DD) |
title |
Standardized exam name (English) |
category |
Exam type: imaging, ultrasound, endoscopy, other |
exam_name_raw |
Exam name exactly as written in document |
doctor |
Physician name (if found) |
facility |
Healthcare facility name |
department |
Department within facility |
confidence |
Self-consistency confidence score (0.0-1.0) |
page |
Page number in source PDF |
source |
Source PDF filename |
Architecture
parsemedicalexams/
โโโ main.py # Pipeline orchestration, CLI handling
โโโ extraction.py # Pydantic models, Vision LLM extraction, voting
โโโ standardization.py # Exam type classification with JSON cache
โโโ summarization.py # Document-level clinical summarization
โโโ config.py # ExtractionConfig (.env) + ProfileConfig (profiles/)
โโโ utils.py # Image preprocessing, logging, JSON utilities
โโโ prompts/ # Externalized LLM prompts as markdown
โโโ profiles/ # User-specific path configurations
โโโ config/cache/ # Persistent LLM response caches (user-editable)
Key Design Patterns
- Two-phase processing: Classify document first, then transcribe all pages
- Two-column naming:
*_raw(exact from document) +*_standardized(LLM-mapped) - Persistent caching: LLM standardization results cached in
config/cache/*.json - Editable caches: Manually override cached values to fix misclassifications
- Profile inheritance: Profiles can inherit from
.envwith overrides - Frequency-based date voting: Handles multi-era documents (e.g., 2024 cover letter + 1997 records)
Requirements
- Python 3.8+
- Poppler for PDF processing
- OpenRouter API key for Vision LLM access
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsemedicalexams-0.1.3.tar.gz.
File metadata
- Download URL: parsemedicalexams-0.1.3.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4db4224ddee16fd2c6b3d9a6946cd2b0b8b784fb52959047b4dc53b06eaed7f4
|
|
| MD5 |
ac797b8cd3d34b74e86d755c65073c60
|
|
| BLAKE2b-256 |
8b8d575e7f5a4c78919c9e98c8e85d78fb695ab2de47135dcfd5063628939e5a
|
Provenance
The following attestation bundles were made for parsemedicalexams-0.1.3.tar.gz:
Publisher:
release.yml on tsilva/parsemedicalexams
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsemedicalexams-0.1.3.tar.gz -
Subject digest:
4db4224ddee16fd2c6b3d9a6946cd2b0b8b784fb52959047b4dc53b06eaed7f4 - Sigstore transparency entry: 1005489854
- Sigstore integration time:
-
Permalink:
tsilva/parsemedicalexams@e810c5b36a35f018f9dbc0351d7732fa936ef2a1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e810c5b36a35f018f9dbc0351d7732fa936ef2a1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file parsemedicalexams-0.1.3-py3-none-any.whl.
File metadata
- Download URL: parsemedicalexams-0.1.3-py3-none-any.whl
- Upload date:
- Size: 29.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d373630ac05a311d6a799c3b5eccb35380e8ec636b25be7d386b86396e8858d8
|
|
| MD5 |
21e2887284dcb2c13c302c2ac3cc1761
|
|
| BLAKE2b-256 |
88e8054c8c444984587d95cb118b62ce90b18c14595053d46be4517f9e6f4c5a
|
Provenance
The following attestation bundles were made for parsemedicalexams-0.1.3-py3-none-any.whl:
Publisher:
release.yml on tsilva/parsemedicalexams
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsemedicalexams-0.1.3-py3-none-any.whl -
Subject digest:
d373630ac05a311d6a799c3b5eccb35380e8ec636b25be7d386b86396e8858d8 - Sigstore transparency entry: 1005489855
- Sigstore integration time:
-
Permalink:
tsilva/parsemedicalexams@e810c5b36a35f018f9dbc0351d7732fa936ef2a1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e810c5b36a35f018f9dbc0351d7732fa936ef2a1 -
Trigger Event:
push
-
Statement type: