Skip to main content

Structural-variant phasing from HP-tagged long-read BAMs

Project description

SvPhaser

Haplotype-aware structural-variant (SV) phasing and genotyping from long-read data

PyPI version Python License


SvPhaser assigns haplotype-aware genotypes to pre-called structural variants (SVs) using HP-tagged long-read alignments (PacBio HiFi, ONT Q20+, etc.).

It fills a critical gap in long-read SV analysis:

  • SV callers (e.g. Sniffles2) discover variants
  • SvPhaser phases and genotypes them (1|0, 0|1, 1|1, or ./.)
  • with explicit read-level evidence and a quantitative genotype quality (GQ) score

SvPhaser is:

  • Caller-agnostic — works with any SV VCF format
  • Deterministic — no random sampling or HMMs; reproducible results
  • Designed for large-scale benchmarking and biological interpretation — CSV-first output for transparent analysis

Key features

  • Post-hoc SV phasing from HP-tagged BAM/CRAM — no re-calling needed
  • Per-chromosome parallelization — efficiently scales on HPC and multi-core systems
  • SV-type-aware evidence detection — specialized logic for DEL / INS / INV / BND / DUP
  • Deterministic Δ-based decision logic — haplotype imbalance thresholds, no sampling
  • Strict size consistency controls — optional size-matching for DEL/INS variants
  • Explicit confidence scoring — Phred-scaled GQ capped at 99, with derivable binning
  • CSV-first design — transparent per-SV metrics for benchmarking and debugging
  • VCF-compliant output — rich SVP_* INFO annotations for downstream analysis
  • Read-level evidence tracking — counts by haplotype (HP1, HP2, untagged) with reason codes
  • Hybrid support counting — combines HP-tagged + untagged reads with configurable thresholds

Installation

From PyPI (recommended)

# Requires Python >= 3.9
pip install svphaser

Optional extras:

pip install "svphaser[plots]"   # plotting utilities
pip install "svphaser[bench]"   # benchmarking helpers
pip install "svphaser[dev]"     # development + linting

From source

git clone https://github.com/SFGLab/SvPhaser.git
cd SvPhaser
pip install -e .

Inputs & requirements

SvPhaser requires two inputs only:

  1. Unphased SV VCF (.vcf / .vcf.gz)

    • Produced by an SV caller (e.g. Sniffles2)
    • May optionally contain RNAMES INFO for precise read support
  2. HP-tagged BAM/CRAM

    • Long-read alignments with haplotype tags (HP=1/2)
    • Generated by an upstream phasing pipeline (e.g. WhatsHap)

⚠️ If the BAM does not contain HP tags, SvPhaser cannot assign haplotypes.


Quick start (CLI)

svphaser phase \
  sample_unphased.vcf.gz \
  sample.sorted_phased.bam \
  --out-dir results/ \
  --min-support 10 \
  --min-tagged-support 3 \
  --major-delta 0.60 \
  --equal-delta 0.10 \
  --support-mode hybrid \
  --dynamic-window \
  --tie-to-hom-alt \
  --gq-bins "30:High,10:Moderate" \
  --threads 32

Key parameters

Parameter Default Meaning
--min-support 10 Minimum total supporting reads (HP1+HP2+NOHP) to keep an SV; others are dropped to ./.
--min-tagged-support 3 Minimum HP-tagged reads (HP1+HP2) needed for directional phasing (1|0 or 0|1)
--major-delta 0.60 Haplotype imbalance threshold (max HP count / tagged total) for strong consensus
--equal-delta 0.10 Tie threshold (|HP1-HP2| / tagged total); below this, treated as both haplotypes support (→ 1|1)
--tie-to-hom-alt True When tie detected and both haplotypes carry reads, emit 1|1 (else ./.)
--support-mode hybrid Count method: hybrid (HP tagged preferred), tagged-only, or all
--gq-bins "30:High,10:Moderate" Confidence cutoffs for soft binning into labels (e.g., High≥30, Moderate≥10)
--threads 1 Number of parallel workers (one per chromosome)
--no-svp-info Disable writing SVP_* INFO annotations to output VCF
--size-match-required True For DEL/INS: enforce size consistency between VCF record and read evidence
--size-tol-abs 10 Absolute size tolerance (bp) for DEL/INS matching
--size-tol-frac 0.0 Fractional size tolerance for DEL/INS matching

Outputs

For an input sample.vcf.gz, SvPhaser produces:

Primary: sample_phased.csv

A tabular summary with per-SV analysis, including:

  • Metadata: chrom, pos, id, end, svtype (DEL/INS/INV/BND/DUP)
  • Evidence counts: hp1, hp2, nohp (haplotype-tagged and untagged supporting reads)
  • Totals: tagged_total (HP1+HP2), support_total (HP1+HP2+NOHP)
  • Decision metrics:
    • delta — haplotype imbalance (max/tagged_total)
    • equal_delta — absolute difference (|HP1-HP2|/tagged_total)
    • tag_frac — fraction of support that is HP-tagged
  • Final calls:
    • gt — phased genotype (1|0, 0|1, 1|1, or ./.)
    • gq — Phred-scaled genotype quality (0–99)
    • gq_label — optional binned confidence level (e.g., "High", "Moderate")
    • reason — explanation code (e.g., "MinSupport", "Tie", "LowTagged")

Secondary: sample_phased.vcf

Interoperability output with:

  • FORMAT fields: GT (phased), GQ (quality)
  • INFO annotations (when --svp-info enabled):
    • SVP_HP1, SVP_HP2, SVP_NOHP — read counts
    • SVP_TAGFRAC — fraction tagged
    • SVP_DELTA — haplotype imbalance
    • SVP_GQBIN — confidence level label

The CSV is the primary artifact for analysis; the VCF is for compatibility and downstream tools.


Phasing decision logic (quick reference)

For each SV, SvPhaser counts reads by haplotype tag (HP=1, HP=2, or missing) and applies a deterministic decision tree:

  1. Minimum support gate: If support_total (HP1+HP2+NOHP) < min_support → emit ./. and drop SV
  2. Tagged support gate: If tagged_total (HP1+HP2) < min_tagged_support → emit ./.
  3. Tie detection: If |HP1 - HP2| / tagged_total ≤ equal_delta
    • If tie_to_hom_alt=True and both HP1 > 0 and HP2 > 0 → emit 1|1 (both haplotypes carry)
    • Else → emit ./. (ambiguous)
  4. Strong majority: If max(HP1, HP2) / tagged_total ≥ major_delta
    • If HP1 > HP2 → emit 1|0 (ALT on haplotype 1)
    • If HP2 > HP1 → emit 0|1 (ALT on haplotype 2)
  5. Else: → emit ./. (weak or no signal)

Genotype Quality (GQ) is calculated from a Phred-scaled binomial tail probability:

  • For shallow coverage (N ≤ 200): exact binomial test
  • For deep coverage (N > 200): continuity-corrected normal approximation (avoids overflow)
  • Capped at 99 (Phred scale)

A full, implementation-faithful description of the algorithm—including:

  • evidence collection
  • haplotype decision logic
  • pseudoalgorithm
  • workflow diagram

is provided in:

➡️ docs/Methodology.md

This document is the authoritative reference for reviewers and users seeking algorithmic clarity.


Python API

from pathlib import Path
from svphaser import phase

# Simple usage
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
)

# Full control
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    tie_to_hom_alt=True,
    gq_bins="30:High,10:Moderate",
    threads=8,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
)

print(f"Phased VCF: {out_vcf}")
print(f"Summary CSV: {out_csv}")

Returns a tuple: (phased_vcf_path, summary_csv_path)

Alternatively, use the lower-level API directly:

from svphaser.phasing.io import phase_vcf
from svphaser.phasing.types import WorkerOpts

opts = WorkerOpts(
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    tie_to_hom_alt=True,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
    gq_bins=[(30, "High"), (10, "Moderate")],
)

phase_vcf(
    Path("sample.vcf.gz"),
    Path("sample.bam"),
    out_dir=Path("results"),
    worker_opts=opts,
    threads=8,
)

Repository structure

SvPhaser/
├─ src/svphaser/            # main package
│  ├─ cli.py               # CLI interface (Typer app)
│  ├─ __init__.py          # public API (phase() function)
│  ├─ logging.py           # logging configuration
│  ├─ phasing/             # core algorithms & I/O
│  │  ├─ algorithms.py     # haplotype classification, GQ calculation (pure math)
│  │  ├─ io.py            # orchestration, CSV/VCF writing (per-chromosome workers)
│  │  ├─ _workers.py      # internal: per-chromosome worker, read evidence counting
│  │  ├─ types.py         # WorkerOpts, CallTuple, type aliases
│  │  └─ __init__.py      # public API exports
│  └─ py.typed            # PEP 561 marker for type information
│
├─ tests/                   # unit & regression tests
│  ├─ test_algorithms.py   # GQ, classification logic
│  ├─ test_cli_smoke.py    # CLI smoke tests
│  ├─ test_io.py          # CSV/VCF output validation
│  ├─ test_workers.py     # BAM parsing, read counting
│  └─ data/               # minimal test fixtures
│
├─ docs/                    # documentation
│  ├─ Methodology.md       # algorithmic deep-dive (implementation-faithful)
│  └─ Presentation/        # slide decks & figures
│
├─ Benchmarking_Analysis/   # perf analysis & results
├─ pyproject.toml          # PEP 621 metadata, build config
├─ requirements.txt        # runtime dependencies (mirror of pyproject)
├─ requirements-dev.txt    # dev/test dependencies
├─ README.md              # this file
├─ CONTRIBUTING.md        # contributor guidelines
├─ CODE_OF_CONDUCT.md     # community standards
├─ LICENSE                # MIT
└─ CHANGELOG.md           # version history

Core modules

algorithms.py — Pure mathematics (no I/O)

  • phasing_gq(n1, n2) — Phred-scaled genotype quality (binomial tail + normal approx)
  • classify_haplotype(n1, n2, ...) — GT decision tree (returns ("1|0"|"0|1"|"1|1"|"./.", gq))
  • Threshold logic: major_delta, equal_delta, min_support, tie_to_hom_alt

_workers.py — Per-chromosome logic

  • Read BAM for each chromosome, count HP tags
  • Apply size-consistency filters (DEL/INS)
  • Call classify_haplotype() for each SV
  • Return formatted results (gt, gq, reason)

io.py — Orchestration & I/O

  • Parse VCF header, spawn workers (one per chromosome)
  • Merge per-chromosome results, apply global filters
  • Write phased VCF + CSV summary
  • Backfill optional columns (gq_label, tag_frac, etc.)

Citing SvPhaser

If SvPhaser contributes to your research, please cite:

@software{svphaser2026,
  author  = {Pranjul Mishra and Sachin Gadakh},
  title   = {SvPhaser: Haplotype-aware phasing of structural variants from long-read data},
  version = {2.1.x},
  year    = {2026},
  url     = {https://github.com/SFGLab/SvPhaser},
  note    = {PyPI: https://pypi.org/project/svphaser/}
}

For maximum reproducibility, include the exact git commit hash used.


License

SvPhaser is released under the MIT License — see LICENSE.


Contact

Developed at SFG Lab (BioAI).

Bug reports and feature requests: please open a GitHub issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svphaser-2.2.2.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

svphaser-2.2.2-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file svphaser-2.2.2.tar.gz.

File metadata

  • Download URL: svphaser-2.2.2.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for svphaser-2.2.2.tar.gz
Algorithm Hash digest
SHA256 b644e838d6926415884fb817038a401e0d04cef7ed9c1f4e3bb1bafb37124ada
MD5 1f9bd5f45596a4e7704f90701de316d4
BLAKE2b-256 204e605063df4532979de6e06fa9d37e59178de4ac8b9df411c924ac7243505d

See more details on using hashes here.

File details

Details for the file svphaser-2.2.2-py3-none-any.whl.

File metadata

  • Download URL: svphaser-2.2.2-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for svphaser-2.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 014a928d814690e54e70ac55fa7441e372d5157308ea0fd2dae8a5452a0c212b
MD5 71ad5679d4c0db8926f5176333f78da8
BLAKE2b-256 d9dd36cc9bd7af69fa8c71ae8f8935d53145d79fe245aadb54f401bd9622b39f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page