Skip to main content

Extract structured lab test results from medical documents with AI precision

Project description

parselabs

parselabs

License Python

🔬 Extract lab results from medical PDFs using AI vision with self-consistency 📊

Documentation · Issues


Overview

parselabs uses AI vision models to extract laboratory test results from PDF documents and images, converting unstructured medical reports into clean, standardized CSV/Excel data. It automatically normalizes test names, converts units, and validates results for accuracy.

Features

  • AI-Powered Extraction — Vision models extract lab names, values, units, and reference ranges directly from PDF pages
  • Smart Validation — Detects extraction errors across 5 categories: biological plausibility, inter-lab relationships, temporal consistency, format artifacts, and reference range deviations
  • Cost-Optimized — Text-first extraction uses cheaper LLM calls when PDF text is parseable, falling back to vision only when needed
  • Profile-Based Workflow — Configure multiple profiles for different users or data sources with simple YAML files
  • Gradio Review UI — Side-by-side comparison of source documents and extracted data with keyboard shortcuts
  • 335+ Standardized Labs — Comprehensive lab specifications with unit conversions and reference ranges

Quick Start

# Install dependencies
uv sync

# Create your profile
cp profiles/_template.yaml profiles/myname.yaml
# Edit profiles/myname.yaml with your input/output paths

# Configure environment (copy .env.example and edit)
cp .env.example .env
# Edit .env with your API key and model settings

# Extract lab results
python main.py --profile myname

# Review results
python review.py --profile myname

Installation

Prerequisites

  • Python 3.8+
  • uv package manager
  • Poppler for PDF processing

Setup

git clone https://github.com/tsilva/parselabs.git
cd parselabs
uv sync

macOS (Poppler)

brew install poppler

Environment Variables

Create a .env file:

# Required
OPENROUTER_API_KEY=your_key_here
EXTRACT_MODEL_ID=google/gemini-3-flash-preview       # Vision model for extraction
SELF_CONSISTENCY_MODEL_ID=google/gemini-3-flash-preview  # Model for self-consistency

# Optional
N_EXTRACTIONS=1    # Self-consistency extractions
MAX_WORKERS=4      # Parallel workers

Configuration

Profiles

Profiles define input/output paths and optional settings. Create one per user or data source:

# profiles/john.yaml
name: "John Doe"
input_path: "/path/to/lab/pdfs"
output_path: "/path/to/output"
input_file_regex: "*.pdf"  # Optional filter

# Optional demographics for personalized ranges
demographics:
  gender: "male"
  date_of_birth: "1990-01-15"

List available profiles:

python main.py --list-profiles

Lab Specifications

The config/lab_specs.json file contains 335+ standardized lab tests with:

  • Primary units and conversion factors
  • Reference ranges
  • Biological limits for validation
  • Inter-lab relationships (e.g., LDL Friedewald formula)

Usage

Extract Lab Results

# Run all profiles (default)
python main.py

# Run specific profile
python main.py --profile myname

# Override model
python main.py --profile myname --model google/gemini-2.5-pro

# Filter files
python main.py --profile myname --pattern "2024-*.pdf"

Review Extracted Data

python review.py --profile myname

The Gradio-based review UI provides:

  • Side-by-side view — Source document image alongside extracted data
  • Keyboard shortcuts — Y=Accept, N=Reject, S=Skip, Arrow keys=Navigate
  • Smart filters — Unreviewed, Low Confidence, Needs Review, Accepted, Rejected
  • Progress tracking — Review counts and completion status

Validate Data Integrity

python test.py

Checks for duplicate rows, missing dates, outliers, and naming conventions.

Output

For each PDF, the tool generates:

File Description
{doc}/ Directory with page images and JSON extractions
{doc}.csv Combined results for the document
all.csv Merged results from all documents
all.xlsx Excel workbook with formatted data

Output Schema

Column Description
date Report/collection date
lab_name Standardized name (e.g., "Blood - Glucose")
value Numeric value in primary unit
unit Primary unit (e.g., "mg/dL")
reference_min/max Reference range from report
raw_lab_name, raw_value, raw_unit Original values for audit
review_needed Boolean flag for items needing review
review_reason Validation reason codes

Architecture

The extraction pipeline has 5 stages:

  1. PDF Processing — Text extraction or page-to-image conversion
  2. Extraction — Vision/text LLM extracts structured LabResult objects
  3. Standardization — Maps to standardized names and units
  4. Normalization — Converts values to primary units
  5. Validation — Flags suspicious values for review

For detailed documentation, see docs/pipeline.md.

Validation Categories

Category Reason Codes Description
Biological Plausibility NEGATIVE_VALUE, IMPOSSIBLE_VALUE, PERCENTAGE_BOUNDS Values outside biological limits
Inter-Lab Relationships RELATIONSHIP_MISMATCH Calculated values don't match formulas
Temporal Consistency TEMPORAL_ANOMALY Implausible change rate between tests
Format Artifacts FORMAT_ARTIFACT OCR/extraction concatenation errors
Reference Ranges RANGE_INCONSISTENCY, EXTREME_DEVIATION Reference range issues

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselabs-0.1.2.tar.gz (408.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parselabs-0.1.2-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file parselabs-0.1.2.tar.gz.

File metadata

  • Download URL: parselabs-0.1.2.tar.gz
  • Upload date:
  • Size: 408.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parselabs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fd5bd0414c4e9d00a35de5719dcdb668a7b0fbe147c4659c3ba740513ce9c1f0
MD5 103dc0cb3343d5967fac5156fc40e08f
BLAKE2b-256 bd1b3eda50581825ab2f28d230f905f0106472272fd290474f28990c46dd96ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for parselabs-0.1.2.tar.gz:

Publisher: release.yml on tsilva/parselabs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parselabs-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: parselabs-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parselabs-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a8ceecce3d0ee8ac25e49d10f1b0b91669488a5d49effd1906169dcaeda6f24
MD5 82f4d6f2042311846d21476d525247d9
BLAKE2b-256 f05559f85f7c6a32a58293c89dc5adb3e1da9bde02fc43502329db236d05fb76

See more details on using hashes here.

Provenance

The following attestation bundles were made for parselabs-0.1.2-py3-none-any.whl:

Publisher: release.yml on tsilva/parselabs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page