Extract structured lab test results from medical documents with AI precision
Project description
parselabs
🔬 Extract lab results from medical PDFs using AI vision with self-consistency 📊
Overview
parselabs uses AI vision models to extract laboratory test results from PDF documents and images, converting unstructured medical reports into clean, standardized CSV/Excel data. It automatically normalizes test names, converts units, and validates results for accuracy.
Features
- AI-Powered Extraction — Vision models extract lab names, values, units, and reference ranges directly from PDF pages
- Smart Validation — Detects extraction errors across 5 categories: biological plausibility, inter-lab relationships, temporal consistency, format artifacts, and reference range deviations
- Cost-Optimized — Text-first extraction uses cheaper LLM calls when PDF text is parseable, falling back to vision only when needed
- Profile-Based Workflow — Configure multiple profiles for different users or data sources with simple YAML files
- Gradio Review UI — Side-by-side comparison of source documents and extracted data with keyboard shortcuts
- 335+ Standardized Labs — Comprehensive lab specifications with unit conversions and reference ranges
Quick Start
# Install dependencies
uv sync
# Create your profile
cp profiles/_template.yaml profiles/myname.yaml
# Edit profiles/myname.yaml with your input/output paths
# Configure environment (copy .env.example and edit)
cp .env.example .env
# Edit .env with your API key and model settings
# Extract lab results
python main.py --profile myname
# Review results
python review.py --profile myname
Installation
Prerequisites
Setup
git clone https://github.com/tsilva/parselabs.git
cd parselabs
uv sync
macOS (Poppler)
brew install poppler
Environment Variables
Create a .env file:
# Required
OPENROUTER_API_KEY=your_key_here
EXTRACT_MODEL_ID=google/gemini-3-flash-preview # Vision model for extraction
SELF_CONSISTENCY_MODEL_ID=google/gemini-3-flash-preview # Model for self-consistency
# Optional
N_EXTRACTIONS=1 # Self-consistency extractions
MAX_WORKERS=4 # Parallel workers
Configuration
Profiles
Profiles define input/output paths and optional settings. Create one per user or data source:
# profiles/john.yaml
name: "John Doe"
input_path: "/path/to/lab/pdfs"
output_path: "/path/to/output"
input_file_regex: "*.pdf" # Optional filter
# Optional demographics for personalized ranges
demographics:
gender: "male"
date_of_birth: "1990-01-15"
List available profiles:
python main.py --list-profiles
Lab Specifications
The config/lab_specs.json file contains 335+ standardized lab tests with:
- Primary units and conversion factors
- Reference ranges
- Biological limits for validation
- Inter-lab relationships (e.g., LDL Friedewald formula)
Usage
Extract Lab Results
# Run all profiles (default)
python main.py
# Run specific profile
python main.py --profile myname
# Override model
python main.py --profile myname --model google/gemini-2.5-pro
# Filter files
python main.py --profile myname --pattern "2024-*.pdf"
Review Extracted Data
python review.py --profile myname
The Gradio-based review UI provides:
- Side-by-side view — Source document image alongside extracted data
- Keyboard shortcuts — Y=Accept, N=Reject, S=Skip, Arrow keys=Navigate
- Smart filters — Unreviewed, Low Confidence, Needs Review, Accepted, Rejected
- Progress tracking — Review counts and completion status
Validate Data Integrity
python test.py
Checks for duplicate rows, missing dates, outliers, and naming conventions.
Output
For each PDF, the tool generates:
| File | Description |
|---|---|
{doc}/ |
Directory with page images and JSON extractions |
{doc}.csv |
Combined results for the document |
all.csv |
Merged results from all documents |
all.xlsx |
Excel workbook with formatted data |
Output Schema
| Column | Description |
|---|---|
date |
Report/collection date |
lab_name |
Standardized name (e.g., "Blood - Glucose") |
value |
Numeric value in primary unit |
unit |
Primary unit (e.g., "mg/dL") |
reference_min/max |
Reference range from report |
raw_lab_name, raw_value, raw_unit |
Original values for audit |
review_needed |
Boolean flag for items needing review |
review_reason |
Validation reason codes |
Architecture
The extraction pipeline has 5 stages:
- PDF Processing — Text extraction or page-to-image conversion
- Extraction — Vision/text LLM extracts structured
LabResultobjects - Standardization — Maps to standardized names and units
- Normalization — Converts values to primary units
- Validation — Flags suspicious values for review
For detailed documentation, see docs/pipeline.md.
Validation Categories
| Category | Reason Codes | Description |
|---|---|---|
| Biological Plausibility | NEGATIVE_VALUE, IMPOSSIBLE_VALUE, PERCENTAGE_BOUNDS |
Values outside biological limits |
| Inter-Lab Relationships | RELATIONSHIP_MISMATCH |
Calculated values don't match formulas |
| Temporal Consistency | TEMPORAL_ANOMALY |
Implausible change rate between tests |
| Format Artifacts | FORMAT_ARTIFACT |
OCR/extraction concatenation errors |
| Reference Ranges | RANGE_INCONSISTENCY, EXTREME_DEVIATION |
Reference range issues |
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parselabs-0.1.2.tar.gz.
File metadata
- Download URL: parselabs-0.1.2.tar.gz
- Upload date:
- Size: 408.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd5bd0414c4e9d00a35de5719dcdb668a7b0fbe147c4659c3ba740513ce9c1f0
|
|
| MD5 |
103dc0cb3343d5967fac5156fc40e08f
|
|
| BLAKE2b-256 |
bd1b3eda50581825ab2f28d230f905f0106472272fd290474f28990c46dd96ac
|
Provenance
The following attestation bundles were made for parselabs-0.1.2.tar.gz:
Publisher:
release.yml on tsilva/parselabs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parselabs-0.1.2.tar.gz -
Subject digest:
fd5bd0414c4e9d00a35de5719dcdb668a7b0fbe147c4659c3ba740513ce9c1f0 - Sigstore transparency entry: 1005485984
- Sigstore integration time:
-
Permalink:
tsilva/parselabs@af453d01862db404ae5e7972fdf898de307de35b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@af453d01862db404ae5e7972fdf898de307de35b -
Trigger Event:
push
-
Statement type:
File details
Details for the file parselabs-0.1.2-py3-none-any.whl.
File metadata
- Download URL: parselabs-0.1.2-py3-none-any.whl
- Upload date:
- Size: 40.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a8ceecce3d0ee8ac25e49d10f1b0b91669488a5d49effd1906169dcaeda6f24
|
|
| MD5 |
82f4d6f2042311846d21476d525247d9
|
|
| BLAKE2b-256 |
f05559f85f7c6a32a58293c89dc5adb3e1da9bde02fc43502329db236d05fb76
|
Provenance
The following attestation bundles were made for parselabs-0.1.2-py3-none-any.whl:
Publisher:
release.yml on tsilva/parselabs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parselabs-0.1.2-py3-none-any.whl -
Subject digest:
0a8ceecce3d0ee8ac25e49d10f1b0b91669488a5d49effd1906169dcaeda6f24 - Sigstore transparency entry: 1005485987
- Sigstore integration time:
-
Permalink:
tsilva/parselabs@af453d01862db404ae5e7972fdf898de307de35b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@af453d01862db404ae5e7972fdf898de307de35b -
Trigger Event:
push
-
Statement type: