Skip to main content

High-performance VCF to PostgreSQL loader with clinical-grade compliance

Project description

vcf-pg-loader

CI install with bioconda PyPI version

High-performance VCF to PostgreSQL loader with clinical-grade compliance.

Features

  • Streaming VCF parsing with cyvcf2 for memory-efficient processing
  • Variant normalization using the vt algorithm (left-align and trim)
  • Number=A/R/G field handling - proper per-ALT extraction during multi-allelic decomposition
  • Binary COPY protocol via asyncpg for maximum insert performance
  • Chromosome-partitioned tables for efficient region queries
  • Human and non-human genome support - chromosome enum for human, TEXT for others
  • Audit trail with load batch tracking and validation
  • CLI interface with Typer for easy operation
  • TOML configuration - file-based configuration with CLI overrides
  • Progress reporting - real-time progress bar with rich
  • Structured logging - configurable verbosity levels
  • Retry logic - exponential backoff for transient database failures
  • Docker support - multi-stage Dockerfile and docker-compose for development
  • Zero-config database - auto-managed PostgreSQL via Docker, no setup required

Installation

Bioconda (Recommended)

conda install -c conda-forge -c bioconda vcf-pg-loader

PyPI

pip install vcf-pg-loader

Quick Install Script

curl -fsSL https://raw.githubusercontent.com/Zacharyr41/vcf-pg-loader/main/install.sh | bash

This installs vcf-pg-loader and all dependencies (Python, Docker) automatically.

From Source

git clone https://github.com/Zacharyr41/vcf-pg-loader.git
cd vcf-pg-loader
uv pip install -e ".[dev]"

Nextflow Module

For use in Nextflow pipelines, the vcfpgloader/load module is available:

include { VCFPGLOADER } from './modules/local/vcfpgloader/load/main'

workflow {
    VCFPGLOADER(
        ch_vcf,           // tuple val(meta), path(vcf), path(tbi)
        params.db_host,
        params.db_port,
        params.db_name,
        params.db_user,
        params.db_schema
    )
}

The module uses PGPASSWORD as a Nextflow secret. See nf-core modules for integration into nf-core pipelines.

Verify Installation

vcf-pg-loader doctor

Quick Start

Zero-Config Mode (Easiest)

No PostgreSQL setup required - vcf-pg-loader manages a local database automatically:

# Load a VCF file (auto-starts PostgreSQL in Docker)
vcf-pg-loader load sample.vcf.gz

# Check database status
vcf-pg-loader db status

# Open psql shell to query data
vcf-pg-loader db shell

With Your Own PostgreSQL

# Initialize database schema
vcf-pg-loader init-db --db postgresql://user:pass@localhost/variants

# Load a VCF file
vcf-pg-loader load sample.vcf.gz --db postgresql://user:pass@localhost/variants

# Validate a completed load
vcf-pg-loader validate <load-batch-id> --db postgresql://user:pass@localhost/variants

Additional Options

# Load without normalization
vcf-pg-loader load sample.vcf.gz --no-normalize

# Load non-human VCF (e.g., SARS-CoV-2)
vcf-pg-loader load sarscov2.vcf.gz --no-human-genome

# Initialize for non-human genomes
vcf-pg-loader init-db --db postgresql://... --no-human-genome

CLI Commands

load

Load a VCF file into PostgreSQL.

vcf-pg-loader load <vcf_path> [OPTIONS]

Options:
  --db, -d                        PostgreSQL connection URL (omit for auto-managed DB)
  --batch, -b                     Records per batch [default: 50000]
  --workers, -w                   Parallel workers [default: 8]
  --normalize/--no-normalize      Normalize variants using vt algorithm [default: normalize]
  --drop-indexes/--keep-indexes   Drop indexes during load [default: drop-indexes]
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]
  --config, -c                    TOML configuration file
  --verbose, -v                   Enable verbose logging (DEBUG level)
  --quiet, -q                     Suppress non-error output
  --progress/--no-progress        Show progress bar [default: progress]
  --force, -f                     Force reload even if file was already loaded

When --db is omitted, vcf-pg-loader automatically uses a managed PostgreSQL container.

Normalization: When enabled (default), variants are left-aligned and trimmed following the vt algorithm. This ensures consistent representation across different variant callers.

Genome Type: Human genome mode uses a PostgreSQL enum for chromosomes (chr1-22, X, Y, M) which provides validation and efficient storage. Non-human mode uses TEXT to support arbitrary chromosome/contig names.

validate

Validate a completed load by checking record counts and duplicates.

vcf-pg-loader validate <load_batch_id> [OPTIONS]

Options:
  --db, -d    PostgreSQL connection URL

init-db

Initialize the database schema (tables, indexes, extensions).

vcf-pg-loader init-db [OPTIONS]

Options:
  --db, -d                          PostgreSQL connection URL
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]

Important: The genome type must match between init-db and load commands. Use --no-human-genome for both when loading non-human VCFs.

benchmark

Run performance benchmarks on VCF parsing and loading.

vcf-pg-loader benchmark [OPTIONS]

Options:
  --vcf, -f        Path to VCF file (uses built-in fixture if omitted)
  --synthetic, -s  Generate synthetic VCF with N variants
  --db, -d         PostgreSQL URL (omit for parsing-only benchmark)
  --batch, -b      Batch size [default: 50000]
  --normalize/--no-normalize  Test with/without normalization
  --json           Output results as JSON (for CI integration)
  --quiet, -q      Minimal output

Examples:

# Quick benchmark with built-in fixture (~2.6K variants)
vcf-pg-loader benchmark

# Generate and benchmark 100K synthetic variants
vcf-pg-loader benchmark --synthetic 100000

# Benchmark a specific VCF file
vcf-pg-loader benchmark --vcf /path/to/sample.vcf.gz

# Full benchmark including database loading
vcf-pg-loader benchmark --synthetic 50000 --db postgresql://localhost/variants

# JSON output for CI/scripting
vcf-pg-loader benchmark --synthetic 10000 --json

Sample output:

Benchmark Results (synthetic)
  Variants: 100,000
  Batch size: 50,000
  Normalized: True

Parsing: 100,000 variants in 0.94s (106,000/sec)

doctor

Check system dependencies and diagnose issues.

vcf-pg-loader doctor

# Example output:
Dependency Check
  Python         3.12.4   OK
  cyvcf2         0.30.22  OK
  asyncpg        0.29.0   OK
  Docker         24.0.5   OK
  Docker daemon  running  OK

db

Manage the local PostgreSQL database (Docker-based).

vcf-pg-loader db start   # Start PostgreSQL container
vcf-pg-loader db stop    # Stop the container
vcf-pg-loader db status  # Show running status and connection URL
vcf-pg-loader db url     # Print connection URL (for scripts)
vcf-pg-loader db shell   # Open psql shell
vcf-pg-loader db reset   # Remove container and all data

Architecture

Components

  1. VCFHeaderParser - Parses VCF headers via cyvcf2's native API to extract INFO/FORMAT field definitions
  2. VCFStreamingParser - Memory-efficient streaming iterator that yields batches of VariantRecord objects
  3. VariantParser - Handles per-variant parsing with Number=A/R/G field extraction for multi-allelic decomposition
  4. VCFLoader - Orchestrates loading with asyncpg binary COPY protocol
  5. SchemaManager - Manages PostgreSQL schema creation and index management

Data Flow

VCF File → VCFStreamingParser → Batch Buffer → asyncpg COPY → PostgreSQL
                ↓
         VCFHeaderParser (field metadata)
                ↓
         VariantParser (Number=A/R/G extraction)

Citations and Acknowledgments

This project was inspired by and builds upon several foundational tools in the genomics community:

Primary References

Slivar - Rapid variant filtering:

Pedersen, B.S., Brown, J.M., Dashnow, H. et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. npj Genom. Med. 6, 60 (2021). https://doi.org/10.1038/s41525-021-00227-3

GEMINI - Original SQL-based VCF database:

Paila, U., Chapman, B.A., Kirchner, R., & Quinlan, A.R. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations. PLoS Comput Biol 9(7): e1003153 (2013). https://doi.org/10.1371/journal.pcbi.1003153

cyvcf2 - Python VCF parsing:

Pedersen, B.S. & Quinlan, A.R. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33(12), 1867–1869 (2017). https://doi.org/10.1093/bioinformatics/btx057

Supporting Tools

Configuration

vcf-pg-loader supports TOML configuration files for persistent settings:

# vcf-pg-loader.toml
[vcf_pg_loader]
batch_size = 25000
workers = 16
normalize = true
drop_indexes = true
human_genome = true
log_level = "INFO"

Use with the --config flag:

vcf-pg-loader load sample.vcf.gz --config vcf-pg-loader.toml

CLI arguments override config file values.

Docker

Using Docker Compose (recommended for development)

# Start PostgreSQL and run a load
docker-compose up -d postgres
docker-compose run vcf-pg-loader load /data/sample.vcf.gz --db postgresql://vcfloader:vcfloader@postgres:5432/variants

# Or build and run standalone
docker build -t vcf-pg-loader .
docker run vcf-pg-loader --help

Docker Compose Services

  • postgres: PostgreSQL 16 with health checks
  • vcf-pg-loader: The loader application

Mount your VCF files to /data in the container.

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=vcf_pg_loader

# Run only unit tests (skip integration)
uv run pytest -m "not integration"

Code Quality

# Lint
uv run ruff check src tests

# Type check
uv run mypy src

Documentation

License

MIT - See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf_pg_loader-0.5.4.tar.gz (790.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcf_pg_loader-0.5.4-py3-none-any.whl (53.9 kB view details)

Uploaded Python 3

File details

Details for the file vcf_pg_loader-0.5.4.tar.gz.

File metadata

  • Download URL: vcf_pg_loader-0.5.4.tar.gz
  • Upload date:
  • Size: 790.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcf_pg_loader-0.5.4.tar.gz
Algorithm Hash digest
SHA256 65e61fa286e57d32f9e6de031cc62bd810a3bca4f0d255313faaedd3c7d5253b
MD5 98e076f585b216a8eae87c51aa79adab
BLAKE2b-256 3e8ffc9c9ddcc2b30a32ca83443b5631964a801d2ea41e420042524b31de89ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_pg_loader-0.5.4.tar.gz:

Publisher: release.yml on Zacharyr41/vcf-pg-loader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcf_pg_loader-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: vcf_pg_loader-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 53.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcf_pg_loader-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b8adf5a4cc5fe926394d2a197f6bb988049f105ff8f6de083a832187f2a5bbe5
MD5 0f7218e980c2da56eeaad212743df346
BLAKE2b-256 c0bb2d1e74c4e2705e4278d4b416fb710e81b44e05b0a39921a7969202c12def

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_pg_loader-0.5.4-py3-none-any.whl:

Publisher: release.yml on Zacharyr41/vcf-pg-loader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page