Extract structured lab test results from medical documents with AI precision

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

engtiagosilva

These details have not been verified by PyPI

Project description

parselabs

🔬 Extract lab results from medical PDFs using AI vision with self-consistency 📊

Documentation · Issues

Overview

parselabs uses AI vision models to extract laboratory test results from PDF documents and images, converting unstructured medical reports into clean, standardized CSV/Excel data. It automatically normalizes test names, converts units, and validates results for accuracy.

Features

AI-Powered Extraction — Vision models extract lab names, values, units, and reference ranges directly from PDF pages
Smart Validation — Detects extraction errors across 5 categories: biological plausibility, inter-lab relationships, temporal consistency, format artifacts, and reference range deviations
Cost-Optimized — Text-first extraction uses cheaper LLM calls when PDF text is parseable, falling back to vision only when needed
Profile-Based Workflow — Configure multiple profiles for different users or data sources with simple YAML files
Gradio Review UI — Side-by-side comparison of source documents and extracted data with keyboard shortcuts
335+ Standardized Labs — Comprehensive lab specifications with unit conversions and reference ranges

Quick Start

# Install dependencies
uv sync

# Create your profile
cp profiles/_template.yaml profiles/myname.yaml
# Edit profiles/myname.yaml with your input/output paths

# Configure environment (copy .env.example and edit)
cp .env.example .env
# Edit .env with your API key and model settings

# Extract lab results
python main.py --profile myname

# Review results
python review.py --profile myname

Installation

Prerequisites

Python 3.8+
uv package manager
Poppler for PDF processing

Setup

git clone https://github.com/tsilva/parselabs.git
cd parselabs
uv sync

macOS (Poppler)

brew install poppler

Environment Variables

Create a .env file:

# Required
OPENROUTER_API_KEY=your_key_here
EXTRACT_MODEL_ID=google/gemini-3-flash-preview       # Vision model for extraction
SELF_CONSISTENCY_MODEL_ID=google/gemini-3-flash-preview  # Model for self-consistency

# Optional
N_EXTRACTIONS=1    # Self-consistency extractions
MAX_WORKERS=4      # Parallel workers

Configuration

Profiles

Profiles define input/output paths and optional settings. Create one per user or data source:

# profiles/john.yaml
name: "John Doe"
input_path: "/path/to/lab/pdfs"
output_path: "/path/to/output"
input_file_regex: "*.pdf"  # Optional filter

# Optional demographics for personalized ranges
demographics:
  gender: "male"
  date_of_birth: "1990-01-15"

List available profiles:

python main.py --list-profiles

Lab Specifications

The config/lab_specs.json file contains 335+ standardized lab tests with:

Primary units and conversion factors
Reference ranges
Biological limits for validation
Inter-lab relationships (e.g., LDL Friedewald formula)

Usage

Extract Lab Results

# Run all profiles (default)
python main.py

# Run specific profile
python main.py --profile myname

# Override model
python main.py --profile myname --model google/gemini-2.5-pro

# Filter files
python main.py --profile myname --pattern "2024-*.pdf"

Review Extracted Data

python review.py --profile myname

The Gradio-based review UI provides:

Side-by-side view — Source document image alongside extracted data
Keyboard shortcuts — Y=Accept, N=Reject, S=Skip, Arrow keys=Navigate
Smart filters — Unreviewed, Low Confidence, Needs Review, Accepted, Rejected
Progress tracking — Review counts and completion status

Validate Data Integrity

python test.py

Checks for duplicate rows, missing dates, outliers, and naming conventions.

Output

For each PDF, the tool generates:

File	Description
`{doc}/`	Directory with page images and JSON extractions
`{doc}.csv`	Combined results for the document
`all.csv`	Merged results from all documents
`all.xlsx`	Excel workbook with formatted data

Output Schema

Column	Description
`date`	Report/collection date
`lab_name`	Standardized name (e.g., "Blood - Glucose")
`value`	Numeric value in primary unit
`unit`	Primary unit (e.g., "mg/dL")
`reference_min/max`	Reference range from report
`raw_lab_name`, `raw_value`, `raw_unit`	Original values for audit
`review_needed`	Boolean flag for items needing review
`review_reason`	Validation reason codes

Architecture

The extraction pipeline has 5 stages:

PDF Processing — Text extraction or page-to-image conversion
Extraction — Vision/text LLM extracts structured LabResult objects
Standardization — Maps to standardized names and units
Normalization — Converts values to primary units
Validation — Flags suspicious values for review

For detailed documentation, see docs/pipeline.md.

Validation Categories

Category	Reason Codes	Description
Biological Plausibility	`NEGATIVE_VALUE`, `IMPOSSIBLE_VALUE`, `PERCENTAGE_BOUNDS`	Values outside biological limits
Inter-Lab Relationships	`RELATIONSHIP_MISMATCH`	Calculated values don't match formulas
Temporal Consistency	`TEMPORAL_ANOMALY`	Implausible change rate between tests
Format Artifacts	`FORMAT_ARTIFACT`	OCR/extraction concatenation errors
Reference Ranges	`RANGE_INCONSISTENCY`, `EXTREME_DEVIATION`	Reference range issues

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

engtiagosilva

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselabs-0.1.2.tar.gz (408.9 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parselabs-0.1.2-py3-none-any.whl (40.7 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file parselabs-0.1.2.tar.gz.

File metadata

Download URL: parselabs-0.1.2.tar.gz
Upload date: Feb 28, 2026
Size: 408.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parselabs-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fd5bd0414c4e9d00a35de5719dcdb668a7b0fbe147c4659c3ba740513ce9c1f0`
MD5	`103dc0cb3343d5967fac5156fc40e08f`
BLAKE2b-256	`bd1b3eda50581825ab2f28d230f905f0106472272fd290474f28990c46dd96ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parselabs-0.1.2.tar.gz:

Publisher: release.yml on tsilva/parselabs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parselabs-0.1.2.tar.gz
- Subject digest: fd5bd0414c4e9d00a35de5719dcdb668a7b0fbe147c4659c3ba740513ce9c1f0
- Sigstore transparency entry: 1005485984
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: tsilva/parselabs@af453d01862db404ae5e7972fdf898de307de35b
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tsilva
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@af453d01862db404ae5e7972fdf898de307de35b
- Trigger Event: push

File details

Details for the file parselabs-0.1.2-py3-none-any.whl.

File metadata

Download URL: parselabs-0.1.2-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 40.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parselabs-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a8ceecce3d0ee8ac25e49d10f1b0b91669488a5d49effd1906169dcaeda6f24`
MD5	`82f4d6f2042311846d21476d525247d9`
BLAKE2b-256	`f05559f85f7c6a32a58293c89dc5adb3e1da9bde02fc43502329db236d05fb76`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parselabs-0.1.2-py3-none-any.whl:

Publisher: release.yml on tsilva/parselabs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parselabs-0.1.2-py3-none-any.whl
- Subject digest: 0a8ceecce3d0ee8ac25e49d10f1b0b91669488a5d49effd1906169dcaeda6f24
- Sigstore transparency entry: 1005485987
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: tsilva/parselabs@af453d01862db404ae5e7972fdf898de307de35b
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tsilva
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@af453d01862db404ae5e7972fdf898de307de35b
- Trigger Event: push

parselabs 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

parselabs

Overview

Features

Quick Start

Installation

Prerequisites

Setup

macOS (Poppler)

Environment Variables

Configuration

Profiles

Lab Specifications

Usage

Extract Lab Results

Review Extracted Data

Validate Data Integrity

Output

Output Schema

Architecture

Validation Categories

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance