High-performance VCF to PostgreSQL loader with clinical-grade compliance

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zacharyr41

These details have not been verified by PyPI

Project description

vcf-pg-loader

High-performance VCF to PostgreSQL loader with clinical-grade compliance.

Features

Streaming VCF parsing with cyvcf2 for memory-efficient processing
Variant normalization using the vt algorithm (left-align and trim)
Number=A/R/G field handling - proper per-ALT extraction during multi-allelic decomposition
Binary COPY protocol via asyncpg for maximum insert performance
Chromosome-partitioned tables for efficient region queries
Human and non-human genome support - chromosome enum for human, TEXT for others
Audit trail with load batch tracking and validation
CLI interface with Typer for easy operation
TOML configuration - file-based configuration with CLI overrides
Progress reporting - real-time progress bar with rich
Structured logging - configurable verbosity levels
Retry logic - exponential backoff for transient database failures
Docker support - multi-stage Dockerfile and docker-compose for development
Zero-config database - auto-managed PostgreSQL via Docker, no setup required

Installation

Bioconda (Recommended)

conda install -c conda-forge -c bioconda vcf-pg-loader

PyPI

pip install vcf-pg-loader

Quick Install Script

curl -fsSL https://raw.githubusercontent.com/Zacharyr41/vcf-pg-loader/main/install.sh | bash

This installs vcf-pg-loader and all dependencies (Python, Docker) automatically.

From Source

git clone https://github.com/Zacharyr41/vcf-pg-loader.git
cd vcf-pg-loader
uv pip install -e ".[dev]"

Nextflow Module

For use in Nextflow pipelines, the vcfpgloader/load module is available:

include { VCFPGLOADER } from './modules/local/vcfpgloader/load/main'

workflow {
    VCFPGLOADER(
        ch_vcf,           // tuple val(meta), path(vcf), path(tbi)
        params.db_host,
        params.db_port,
        params.db_name,
        params.db_user,
        params.db_schema
    )
}

The module uses PGPASSWORD as a Nextflow secret. See nf-core modules for integration into nf-core pipelines.

Verify Installation

vcf-pg-loader doctor

Quick Start

Zero-Config Mode (Easiest)

No PostgreSQL setup required - vcf-pg-loader manages a local database automatically:

# Load a VCF file (auto-starts PostgreSQL in Docker)
vcf-pg-loader load sample.vcf.gz

# Check database status
vcf-pg-loader db status

# Open psql shell to query data
vcf-pg-loader db shell

With Your Own PostgreSQL

# Initialize database schema
vcf-pg-loader init-db --db postgresql://user:pass@localhost/variants

# Load a VCF file
vcf-pg-loader load sample.vcf.gz --db postgresql://user:pass@localhost/variants

# Validate a completed load
vcf-pg-loader validate <load-batch-id> --db postgresql://user:pass@localhost/variants

Additional Options

# Load without normalization
vcf-pg-loader load sample.vcf.gz --no-normalize

# Load non-human VCF (e.g., SARS-CoV-2)
vcf-pg-loader load sarscov2.vcf.gz --no-human-genome

# Initialize for non-human genomes
vcf-pg-loader init-db --db postgresql://... --no-human-genome

CLI Commands

`load`

Load a VCF file into PostgreSQL.

vcf-pg-loader load <vcf_path> [OPTIONS]

Options:
  --db, -d                        PostgreSQL connection URL (omit for auto-managed DB)
  --batch, -b                     Records per batch [default: 50000]
  --workers, -w                   Parallel workers [default: 8]
  --normalize/--no-normalize      Normalize variants using vt algorithm [default: normalize]
  --drop-indexes/--keep-indexes   Drop indexes during load [default: drop-indexes]
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]
  --config, -c                    TOML configuration file
  --verbose, -v                   Enable verbose logging (DEBUG level)
  --quiet, -q                     Suppress non-error output
  --progress/--no-progress        Show progress bar [default: progress]
  --force, -f                     Force reload even if file was already loaded

When --db is omitted, vcf-pg-loader automatically uses a managed PostgreSQL container.

Normalization: When enabled (default), variants are left-aligned and trimmed following the vt algorithm. This ensures consistent representation across different variant callers.

Genome Type: Human genome mode uses a PostgreSQL enum for chromosomes (chr1-22, X, Y, M) which provides validation and efficient storage. Non-human mode uses TEXT to support arbitrary chromosome/contig names.

`validate`

Validate a completed load by checking record counts and duplicates.

vcf-pg-loader validate <load_batch_id> [OPTIONS]

Options:
  --db, -d    PostgreSQL connection URL

`init-db`

Initialize the database schema (tables, indexes, extensions).

vcf-pg-loader init-db [OPTIONS]

Options:
  --db, -d                          PostgreSQL connection URL
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]

Important: The genome type must match between init-db and load commands. Use --no-human-genome for both when loading non-human VCFs.

`benchmark`

Run performance benchmarks on VCF parsing and loading.

vcf-pg-loader benchmark [OPTIONS]

Options:
  --vcf, -f        Path to VCF file (uses built-in fixture if omitted)
  --synthetic, -s  Generate synthetic VCF with N variants
  --db, -d         PostgreSQL URL (omit for parsing-only benchmark)
  --batch, -b      Batch size [default: 50000]
  --normalize/--no-normalize  Test with/without normalization
  --json           Output results as JSON (for CI integration)
  --quiet, -q      Minimal output

Examples:

# Quick benchmark with built-in fixture (~2.6K variants)
vcf-pg-loader benchmark

# Generate and benchmark 100K synthetic variants
vcf-pg-loader benchmark --synthetic 100000

# Benchmark a specific VCF file
vcf-pg-loader benchmark --vcf /path/to/sample.vcf.gz

# Full benchmark including database loading
vcf-pg-loader benchmark --synthetic 50000 --db postgresql://localhost/variants

# JSON output for CI/scripting
vcf-pg-loader benchmark --synthetic 10000 --json

Sample output:

Benchmark Results (synthetic)
  Variants: 100,000
  Batch size: 50,000
  Normalized: True

Parsing: 100,000 variants in 0.94s (106,000/sec)

`doctor`

Check system dependencies and diagnose issues.

vcf-pg-loader doctor

# Example output:
Dependency Check
  Python         3.12.4   OK
  cyvcf2         0.30.22  OK
  asyncpg        0.29.0   OK
  Docker         24.0.5   OK
  Docker daemon  running  OK

`db`

Manage the local PostgreSQL database (Docker-based).

vcf-pg-loader db start   # Start PostgreSQL container
vcf-pg-loader db stop    # Stop the container
vcf-pg-loader db status  # Show running status and connection URL
vcf-pg-loader db url     # Print connection URL (for scripts)
vcf-pg-loader db shell   # Open psql shell
vcf-pg-loader db reset   # Remove container and all data

Architecture

Components

VCFHeaderParser - Parses VCF headers via cyvcf2's native API to extract INFO/FORMAT field definitions
VCFStreamingParser - Memory-efficient streaming iterator that yields batches of VariantRecord objects
VariantParser - Handles per-variant parsing with Number=A/R/G field extraction for multi-allelic decomposition
VCFLoader - Orchestrates loading with asyncpg binary COPY protocol
SchemaManager - Manages PostgreSQL schema creation and index management

Data Flow

VCF File → VCFStreamingParser → Batch Buffer → asyncpg COPY → PostgreSQL
                ↓
         VCFHeaderParser (field metadata)
                ↓
         VariantParser (Number=A/R/G extraction)

Citations and Acknowledgments

This project was inspired by and builds upon several foundational tools in the genomics community:

Primary References

Slivar - Rapid variant filtering:

Pedersen, B.S., Brown, J.M., Dashnow, H. et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. npj Genom. Med. 6, 60 (2021). https://doi.org/10.1038/s41525-021-00227-3

GEMINI - Original SQL-based VCF database:

Paila, U., Chapman, B.A., Kirchner, R., & Quinlan, A.R. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations. PLoS Comput Biol 9(7): e1003153 (2013). https://doi.org/10.1371/journal.pcbi.1003153

cyvcf2 - Python VCF parsing:

Pedersen, B.S. & Quinlan, A.R. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33(12), 1867–1869 (2017). https://doi.org/10.1093/bioinformatics/btx057

Supporting Tools

vcf2db: https://github.com/quinlan-lab/vcf2db
VCF Format: Danecek et al. (2011) https://doi.org/10.1093/bioinformatics/btr330
bcftools/HTSlib: Danecek et al. (2021) https://doi.org/10.1093/gigascience/giab008
GIAB Benchmarks: Zook et al. (2019) https://doi.org/10.1038/s41587-019-0074-6

Configuration

vcf-pg-loader supports TOML configuration files for persistent settings:

# vcf-pg-loader.toml
[vcf_pg_loader]
batch_size = 25000
workers = 16
normalize = true
drop_indexes = true
human_genome = true
log_level = "INFO"

Use with the --config flag:

vcf-pg-loader load sample.vcf.gz --config vcf-pg-loader.toml

CLI arguments override config file values.

Docker

Using Docker Compose (recommended for development)

# Start PostgreSQL and run a load
docker-compose up -d postgres
docker-compose run vcf-pg-loader load /data/sample.vcf.gz --db postgresql://vcfloader:vcfloader@postgres:5432/variants

# Or build and run standalone
docker build -t vcf-pg-loader .
docker run vcf-pg-loader --help

Docker Compose Services

postgres: PostgreSQL 16 with health checks
vcf-pg-loader: The loader application

Mount your VCF files to /data in the container.

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=vcf_pg_loader

# Run only unit tests (skip integration)
uv run pytest -m "not integration"

Code Quality

# Lint
uv run ruff check src tests

# Type check
uv run mypy src

Documentation

CLI Reference - Complete command-line documentation
Genomics Concepts - Understanding VCF data for non-geneticists
Glossary of Terms - Technical terminology reference
Architecture - Detailed system design and implementation

License

MIT - See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zacharyr41

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.4

Dec 25, 2025

0.5.3

Dec 17, 2025

0.5.1

Dec 17, 2025

0.4.0

Dec 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf_pg_loader-0.5.4.tar.gz (790.2 kB view details)

Uploaded Dec 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vcf_pg_loader-0.5.4-py3-none-any.whl (53.9 kB view details)

Uploaded Dec 25, 2025 Python 3

File details

Details for the file vcf_pg_loader-0.5.4.tar.gz.

File metadata

Download URL: vcf_pg_loader-0.5.4.tar.gz
Upload date: Dec 25, 2025
Size: 790.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcf_pg_loader-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`65e61fa286e57d32f9e6de031cc62bd810a3bca4f0d255313faaedd3c7d5253b`
MD5	`98e076f585b216a8eae87c51aa79adab`
BLAKE2b-256	`3e8ffc9c9ddcc2b30a32ca83443b5631964a801d2ea41e420042524b31de89ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_pg_loader-0.5.4.tar.gz:

Publisher: release.yml on Zacharyr41/vcf-pg-loader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vcf_pg_loader-0.5.4.tar.gz
- Subject digest: 65e61fa286e57d32f9e6de031cc62bd810a3bca4f0d255313faaedd3c7d5253b
- Sigstore transparency entry: 779614156
- Sigstore integration time: Dec 25, 2025
Source repository:
- Permalink: Zacharyr41/vcf-pg-loader@bf4332012ac9d35c5432f85aa862d80975eaa9e1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Zacharyr41
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bf4332012ac9d35c5432f85aa862d80975eaa9e1
- Trigger Event: workflow_dispatch

File details

Details for the file vcf_pg_loader-0.5.4-py3-none-any.whl.

File metadata

Download URL: vcf_pg_loader-0.5.4-py3-none-any.whl
Upload date: Dec 25, 2025
Size: 53.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcf_pg_loader-0.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8adf5a4cc5fe926394d2a197f6bb988049f105ff8f6de083a832187f2a5bbe5`
MD5	`0f7218e980c2da56eeaad212743df346`
BLAKE2b-256	`c0bb2d1e74c4e2705e4278d4b416fb710e81b44e05b0a39921a7969202c12def`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_pg_loader-0.5.4-py3-none-any.whl:

Publisher: release.yml on Zacharyr41/vcf-pg-loader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vcf_pg_loader-0.5.4-py3-none-any.whl
- Subject digest: b8adf5a4cc5fe926394d2a197f6bb988049f105ff8f6de083a832187f2a5bbe5
- Sigstore transparency entry: 779614157
- Sigstore integration time: Dec 25, 2025
Source repository:
- Permalink: Zacharyr41/vcf-pg-loader@bf4332012ac9d35c5432f85aa862d80975eaa9e1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Zacharyr41
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bf4332012ac9d35c5432f85aa862d80975eaa9e1
- Trigger Event: workflow_dispatch

vcf-pg-loader 0.5.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

vcf-pg-loader

Features

Installation

Bioconda (Recommended)

PyPI

Quick Install Script

From Source

Nextflow Module

Verify Installation

Quick Start

Zero-Config Mode (Easiest)

With Your Own PostgreSQL

Additional Options

CLI Commands

load

validate

init-db

benchmark

doctor

db

Architecture

Components

Data Flow

Citations and Acknowledgments

Primary References

Supporting Tools

Configuration

Docker

Using Docker Compose (recommended for development)

Docker Compose Services

Development

Running Tests

Code Quality

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`load`

`validate`

`init-db`

`benchmark`

`doctor`

`db`