Structural-variant phasing from HP-tagged long-read BAMs

These details have not been verified by PyPI

Project links

Project description

SvPhaser

Haplotype-aware structural-variant (SV) phasing and genotyping from long-read data

SvPhaser assigns haplotype-aware genotypes to pre-called structural variants (SVs) using HP-tagged long-read alignments (PacBio HiFi, ONT Q20+, etc.).

It fills a critical gap in long-read SV analysis:

SV callers (e.g. Sniffles2) discover variants
SvPhaser phases and genotypes them (1|0, 0|1, 1|1, or ./.)
with explicit read-level evidence and a quantitative genotype quality (GQ) score

SvPhaser is:

Caller-agnostic — works with any SV VCF format
Deterministic — no random sampling or HMMs; reproducible results
Designed for large-scale benchmarking and biological interpretation — CSV-first output for transparent analysis

Key features

Post-hoc SV phasing from HP-tagged BAM/CRAM — no re-calling needed
Per-chromosome parallelization — efficiently scales on HPC and multi-core systems
SV-type-aware evidence detection — specialized logic for DEL / INS / INV / BND / DUP
Deterministic Δ-based decision logic — haplotype imbalance thresholds, no sampling
Strict size consistency controls — optional size-matching for DEL/INS variants
Explicit confidence scoring — Phred-scaled GQ capped at 99, with derivable binning
CSV-first design — transparent per-SV metrics for benchmarking and debugging
VCF-compliant output — rich SVP_* INFO annotations for downstream analysis
Read-level evidence tracking — counts by haplotype (HP1, HP2, untagged) with reason codes
Hybrid support counting — combines HP-tagged + untagged reads with configurable thresholds

Installation

From PyPI (recommended)

# Requires Python >= 3.9
pip install svphaser

Optional extras:

pip install "svphaser[plots]"   # plotting utilities
pip install "svphaser[bench]"   # benchmarking helpers
pip install "svphaser[dev]"     # development + linting

From source

git clone https://github.com/SFGLab/SvPhaser.git
cd SvPhaser
pip install -e .

Inputs & requirements

SvPhaser requires two inputs only:

Unphased SV VCF (.vcf / .vcf.gz)
- Produced by an SV caller (e.g. Sniffles2)
- May optionally contain RNAMES INFO for precise read support
HP-tagged BAM/CRAM
- Long-read alignments with haplotype tags (HP=1/2)
- Generated by an upstream phasing pipeline (e.g. WhatsHap)

⚠️ If the BAM does not contain HP tags, SvPhaser cannot assign haplotypes.

Quick start (CLI)

svphaser phase \
  sample_unphased.vcf.gz \
  sample.sorted_phased.bam \
  --out-dir results/ \
  --min-support 10 \
  --min-tagged-support 3 \
  --major-delta 0.60 \
  --equal-delta 0.10 \
  --support-mode hybrid \
  --dynamic-window \
  --tie-to-hom-alt \
  --gq-bins "30:High,10:Moderate" \
  --threads 32

Key parameters

Parameter	Default	Meaning
`--min-support`	10	Minimum total supporting reads (HP1+HP2+NOHP) to keep an SV; others are dropped to `./.`
`--min-tagged-support`	3	Minimum HP-tagged reads (HP1+HP2) needed for directional phasing (`1\|0` or `0\|1`)
`--major-delta`	0.60	Haplotype imbalance threshold (max HP count / tagged total) for strong consensus
`--equal-delta`	0.10	Tie threshold (\|HP1-HP2\| / tagged total); below this, treated as both haplotypes support (→ `1\|1`)
`--tie-to-hom-alt`	True	When tie detected and both haplotypes carry reads, emit `1\|1` (else `./.`)
`--support-mode`	hybrid	Count method: `hybrid` (HP tagged preferred), `tagged-only`, or `all`
`--gq-bins`	"30:High,10:Moderate"	Confidence cutoffs for soft binning into labels (e.g., High≥30, Moderate≥10)
`--threads`	1	Number of parallel workers (one per chromosome)
`--no-svp-info`	—	Disable writing `SVP_*` INFO annotations to output VCF
`--size-match-required`	True	For DEL/INS: enforce size consistency between VCF record and read evidence
`--size-tol-abs`	10	Absolute size tolerance (bp) for DEL/INS matching
`--size-tol-frac`	0.0	Fractional size tolerance for DEL/INS matching

Outputs

For an input sample.vcf.gz, SvPhaser produces:

Primary: `sample_phased.csv`

A tabular summary with per-SV analysis, including:

Metadata: chrom, pos, id, end, svtype (DEL/INS/INV/BND/DUP)
Evidence counts: hp1, hp2, nohp (haplotype-tagged and untagged supporting reads)
Totals: tagged_total (HP1+HP2), support_total (HP1+HP2+NOHP)
Decision metrics:
- delta — haplotype imbalance (max/tagged_total)
- equal_delta — absolute difference (|HP1-HP2|/tagged_total)
- tag_frac — fraction of support that is HP-tagged
Final calls:
- gt — phased genotype (1|0, 0|1, 1|1, or ./.)
- gq — Phred-scaled genotype quality (0–99)
- gq_label — optional binned confidence level (e.g., "High", "Moderate")
- reason — explanation code (e.g., "MinSupport", "Tie", "LowTagged")

Secondary: `sample_phased.vcf`

Interoperability output with:

FORMAT fields: GT (phased), GQ (quality)
INFO annotations (when --svp-info enabled):
- SVP_HP1, SVP_HP2, SVP_NOHP — read counts
- SVP_TAGFRAC — fraction tagged
- SVP_DELTA — haplotype imbalance
- SVP_GQBIN — confidence level label

The CSV is the primary artifact for analysis; the VCF is for compatibility and downstream tools.

Phasing decision logic (quick reference)

For each SV, SvPhaser counts reads by haplotype tag (HP=1, HP=2, or missing) and applies a deterministic decision tree:

Minimum support gate: If support_total (HP1+HP2+NOHP) < min_support → emit ./. and drop SV
Tagged support gate: If tagged_total (HP1+HP2) < min_tagged_support → emit ./.
Tie detection: If |HP1 - HP2| / tagged_total ≤ equal_delta
- If tie_to_hom_alt=True and both HP1 > 0 and HP2 > 0 → emit 1|1 (both haplotypes carry)
- Else → emit ./. (ambiguous)
Strong majority: If max(HP1, HP2) / tagged_total ≥ major_delta
- If HP1 > HP2 → emit 1|0 (ALT on haplotype 1)
- If HP2 > HP1 → emit 0|1 (ALT on haplotype 2)
Else: → emit ./. (weak or no signal)

Genotype Quality (GQ) is calculated from a Phred-scaled binomial tail probability:

For shallow coverage (N ≤ 200): exact binomial test
For deep coverage (N > 200): continuity-corrected normal approximation (avoids overflow)
Capped at 99 (Phred scale)

A full, implementation-faithful description of the algorithm—including:

evidence collection
haplotype decision logic
pseudoalgorithm
workflow diagram

is provided in:

➡️ docs/Methodology.md

This document is the authoritative reference for reviewers and users seeking algorithmic clarity.

Python API

from pathlib import Path
from svphaser import phase

# Simple usage
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
)

# Full control
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    tie_to_hom_alt=True,
    gq_bins="30:High,10:Moderate",
    threads=8,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
)

print(f"Phased VCF: {out_vcf}")
print(f"Summary CSV: {out_csv}")

Returns a tuple: (phased_vcf_path, summary_csv_path)

Alternatively, use the lower-level API directly:

from svphaser.phasing.io import phase_vcf
from svphaser.phasing.types import WorkerOpts

opts = WorkerOpts(
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    tie_to_hom_alt=True,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
    gq_bins=[(30, "High"), (10, "Moderate")],
)

phase_vcf(
    Path("sample.vcf.gz"),
    Path("sample.bam"),
    out_dir=Path("results"),
    worker_opts=opts,
    threads=8,
)

Repository structure

SvPhaser/
├─ src/svphaser/            # main package
│  ├─ cli.py               # CLI interface (Typer app)
│  ├─ __init__.py          # public API (phase() function)
│  ├─ logging.py           # logging configuration
│  ├─ phasing/             # core algorithms & I/O
│  │  ├─ algorithms.py     # haplotype classification, GQ calculation (pure math)
│  │  ├─ io.py            # orchestration, CSV/VCF writing (per-chromosome workers)
│  │  ├─ _workers.py      # internal: per-chromosome worker, read evidence counting
│  │  ├─ types.py         # WorkerOpts, CallTuple, type aliases
│  │  └─ __init__.py      # public API exports
│  └─ py.typed            # PEP 561 marker for type information
│
├─ tests/                   # unit & regression tests
│  ├─ test_algorithms.py   # GQ, classification logic
│  ├─ test_cli_smoke.py    # CLI smoke tests
│  ├─ test_io.py          # CSV/VCF output validation
│  ├─ test_workers.py     # BAM parsing, read counting
│  └─ data/               # minimal test fixtures
│
├─ docs/                    # documentation
│  ├─ Methodology.md       # algorithmic deep-dive (implementation-faithful)
│  └─ Presentation/        # slide decks & figures
│
├─ Benchmarking_Analysis/   # perf analysis & results
├─ pyproject.toml          # PEP 621 metadata, build config
├─ requirements.txt        # runtime dependencies (mirror of pyproject)
├─ requirements-dev.txt    # dev/test dependencies
├─ README.md              # this file
├─ CONTRIBUTING.md        # contributor guidelines
├─ CODE_OF_CONDUCT.md     # community standards
├─ LICENSE                # MIT
└─ CHANGELOG.md           # version history

Core modules

algorithms.py — Pure mathematics (no I/O)

phasing_gq(n1, n2) — Phred-scaled genotype quality (binomial tail + normal approx)
classify_haplotype(n1, n2, ...) — GT decision tree (returns ("1|0"|"0|1"|"1|1"|"./.", gq))
Threshold logic: major_delta, equal_delta, min_support, tie_to_hom_alt

_workers.py — Per-chromosome logic

Read BAM for each chromosome, count HP tags
Apply size-consistency filters (DEL/INS)
Call classify_haplotype() for each SV
Return formatted results (gt, gq, reason)

io.py — Orchestration & I/O

Parse VCF header, spawn workers (one per chromosome)
Merge per-chromosome results, apply global filters
Write phased VCF + CSV summary
Backfill optional columns (gq_label, tag_frac, etc.)

Citing SvPhaser

If SvPhaser contributes to your research, please cite:

@software{svphaser2026,
  author  = {Pranjul Mishra and Sachin Gadakh},
  title   = {SvPhaser: Haplotype-aware phasing of structural variants from long-read data},
  version = {2.1.x},
  year    = {2026},
  url     = {https://github.com/SFGLab/SvPhaser},
  note    = {PyPI: https://pypi.org/project/svphaser/}
}

For maximum reproducibility, include the exact git commit hash used.

License

SvPhaser is released under the MIT License — see LICENSE.

Contact

Developed at SFG Lab (BioAI).

Pranjul Mishra — pranjul.mishra@proton.me
Sachin Gadakh — s.gadakh@cent.uw.edu.pl

Bug reports and feature requests: please open a GitHub issue.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.2

Apr 6, 2026

2.2.0

Mar 15, 2026

2.1.7

Mar 14, 2026

2.1.6.post1.dev1 pre-release

Mar 14, 2026

2.1.6

Feb 10, 2026

2.1.3

Feb 8, 2026

2.1.2

Feb 6, 2026

2.1.0

Jan 7, 2026

2.0.6

Nov 20, 2025

2.0.4

Nov 20, 2025

2.0.2

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svphaser-2.2.2.tar.gz (24.3 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

svphaser-2.2.2-py3-none-any.whl (28.1 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file svphaser-2.2.2.tar.gz.

File metadata

Download URL: svphaser-2.2.2.tar.gz
Upload date: Apr 6, 2026
Size: 24.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for svphaser-2.2.2.tar.gz
Algorithm	Hash digest
SHA256	`b644e838d6926415884fb817038a401e0d04cef7ed9c1f4e3bb1bafb37124ada`
MD5	`1f9bd5f45596a4e7704f90701de316d4`
BLAKE2b-256	`204e605063df4532979de6e06fa9d37e59178de4ac8b9df411c924ac7243505d`

See more details on using hashes here.

File details

Details for the file svphaser-2.2.2-py3-none-any.whl.

File metadata

Download URL: svphaser-2.2.2-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for svphaser-2.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`014a928d814690e54e70ac55fa7441e372d5157308ea0fd2dae8a5452a0c212b`
MD5	`71ad5679d4c0db8926f5176333f78da8`
BLAKE2b-256	`d9dd36cc9bd7af69fa8c71ae8f8935d53145d79fe245aadb54f401bd9622b39f`

See more details on using hashes here.

svphaser 2.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SvPhaser

Key features

Installation

From PyPI (recommended)

From source

Inputs & requirements

Quick start (CLI)

Key parameters

Outputs

Primary: sample_phased.csv

Secondary: sample_phased.vcf

Phasing decision logic (quick reference)

Python API

Repository structure

Core modules

Citing SvPhaser

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Primary: `sample_phased.csv`

Secondary: `sample_phased.vcf`