CLI haplotype viewer with C++ backend and Python plotting
Project description
haplokit
CLI-first haplotype viewer with bcftools-like selectors, C++ backend acceleration, and Python plotting.
English | 汉语
Installation
Requirements
- Linux or WSL (release lane is Linux-first)
- Python 3.10+
- C++17 toolchain and CMake 3.22+ for source builds
Install from source
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .[test]
Quick Start
haplokit view data/var.sorted.vcf.gz -r scaffold_1:4300-5000 --output-file out
This writes:
out/hapresult.tsvout/hap_summary.tsv
Figures
Haplotype Summary Table
The haplotype summary table displays all identified haplotypes across variant positions in the selected genomic region. Each visual component conveys specific information:
- Title: Shows the genomic region (
CHR:start-end) and, when--gffis provided, the overlapping gene name. - Functional category strip (
--gffonly): A thin colored bar above the POS row. Each variant position is classified into a SnpEff-style functional category (CDS, UTR, exon, intron, upstream/downstream, intergenic) based on GFF3 annotation. Provides an at-a-glance view of which functional regions the variants fall in. - POS row: Physical positions of each variant site.
- ALLELE row: The alternate allele at each position, color-coded by allele identity.
- Haplotype rows (H001, H002, ...): Each row represents one distinct haplotype pattern. Cells show the allele carried at each position. Empty cells indicate the reference allele.
- Population columns (
--populationonly): Columns for each population (e.g., wild, landrace, cultivar) showing sample counts per haplotype. - n/N column: Frequency of each haplotype as count/total.
- SnpEff-style legend (
--gffonly): Legend at the bottom showing functional category colors (CDS, UTR, exon, intron, intergenic). - Indel footnotes: Multi-allele indels (e.g.,
T/A,GG) are annotated with superscript markers; full sequences are explained in the footnote area.
Haplotype Geographic Distribution
The geographic distribution map overlays haplotype composition pie charts onto a base map, revealing spatial patterns of haplotype variation across sampling locations.
- Base map: Province-level boundary polygons from GeoJSON (China via Aliyun DataV API).
- Pie charts: At each sampling location, a pie chart shows haplotype composition. Each slice represents one haplotype.
- √ frequency scaling: Symbol size proportional to √(total sample count), matching R's
symbol.limlogic. - Count labels: The total sample count at each location is displayed at the center of each pie chart.
- Coordinate axes: Longitude (x) and Latitude (y) with degree tick marks in muted grey.
- Legend: Haplotype color legend in the upper-left corner identifies each haplotype.
- Title: Optional figure title.
bcftools-like selector semantics
haplokit view follows the same selector vocabulary shape used in bcftools workflows.
-r/--region chr:start-endselects a range-r/--region chr:posselects a single site-R/--regions-file regions.bedprocesses BED rows independently-S/--samples-file samples.listkeeps only listed samples
Validation rules:
- Exactly one of
-ror-Ris required -rand-Rare mutually exclusive--by siteis only valid with-r chr:pos--by regionconflicts with-r chr:pos--by siteconflicts with-r chr:start-end
C++ backend acceleration
The Python CLI delegates heavy hap grouping work to haplokit_cpp.
Vendored libraries
-
htslib — C library for reading/writing high-throughput sequencing data. Provides native support for VCF and BCF formats, with indexed random access and efficient genotype decoding. Linked as a static library at build time.
-
gffsub — Lightweight GFF3/GTF parser and filter. Parses gene annotation files, supports feature-type filtering (longest transcript selection), and provides overlap/nearest-gene queries for haplotype annotation.
Backend discovery order:
HAPLOKIT_CPP_BIN- packaged binary:
haplokit/_bin/haplokit_cpp - repo builds:
build-wsl/haplokit_cpp, thenbuild/haplokit_cpp - fallback local build:
cmake -S . -B build-wslandcmake --build build-wsl --clean-first -j1
If no backend is found after discovery/build, the CLI exits with an explicit error.
Command
haplokit view <input_vcf> (-r <region> | -R <regions.bed>) [options]
<input_vcf> should be an indexed VCF/BCF file (.vcf.gz + .tbi, or BCF index).
Options and Defaults
| Option | Type | Default | Behavior |
|---|---|---|---|
input_vcf |
positional path | None in parser (required in practice) |
Input indexed VCF/BCF |
-r, --region |
selector string | None |
chr:start-end or chr:pos |
-R, --regions-file |
BED path | None |
BED rows need at least 3 tab-separated columns |
-S, --samples-file |
sample list path | None |
One sample ID per line |
--by |
auto | region | site |
auto |
Auto-resolves from selector shape; parser enforces consistency |
--impute |
flag | False |
Impute missing genotypes as reference before grouping |
-g, --gff3, --gff |
GFF/GFF3 path | None |
Enable gene overlap/nearest annotation |
-p, --population |
population file path | None |
Tab-separated file mapping sample → population group |
--output |
summary | detail |
summary |
Used in JSONL mode; TSV mode always writes both tables |
--output-format |
tsv | jsonl |
tsv |
tsv is default contract |
--output-file |
path | None |
Directory, prefix file, or explicit JSONL file |
--plot |
flag | False |
Render one haplotype table figure per selector |
--plot-format |
png | pdf | svg | tiff |
png |
Output figure format |
--max-diff |
float in [0,1] |
None |
Enables approximate grouping with threshold |
Output Behavior (Current Implementation)
--output-format tsv (default)
- Always writes both:
hapresult*.tsvhap_summary*.tsv
--outputdoes not change this TSV pair behavior.- With
--plot, also writes one figure file per selector (format set by--plot-format). - With
--gff/--gff3, also writesgff_ann_summary.tsv.
File naming rules:
- Single selector, no
--output-file: write to current directory ashapresult.tsvandhap_summary.tsv --output-file <dir>(no suffix): write into that directory--output-file <path/custom.tsv>: use prefix naming:custom.hapresult.tsvcustom.hap_summary.tsv
- BED multi-selector mode: append slug per selector:
hapresult_<chrom>_<start>_<end>.tsvhap_summary_<chrom>_<start>_<end>.tsv- if collisions occur, append
_<NNN>
--output-format jsonl (compatibility mode)
- If
--output-fileis omitted, write JSONL to stdout. - If
--output-fileis a directory path, write<dir>/result.jsonl. - If
--output-fileis a file path, write to that exact file. --output summary|detailis respected in JSONL mode.
Examples
Region mode (strict exact grouping)
haplokit view in.vcf.gz -r chr1:1000-2000 --output-file out
Site mode
haplokit view in.vcf.gz -r chr1:1450 --output-file out_site
BED batch + sample subset + plot + annotation
haplokit view in.vcf.gz -R regions.bed -S samples.list --plot --gff genes.gff3 --output-file out_bed
With population groups and PNG figure
haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --plot --plot-format png --output-file out
JSONL detail mode with approximate grouping
haplokit view in.vcf.gz -r chr1:1000-2000 --max-diff 0.2 --output-format jsonl --output detail --output-file result.jsonl
Upgrading Notes
- Default output is now TSV (
hapresult/hap_summarypair). - Keep
--output-format jsonlonly for compatibility workflows. - Default figure format is now PNG (was triple SVG+PDF+TIFF). Use
--plot-formatto select.
Contributing
Linux/WSL validation path:
cmake -S . -B build-wsl
cmake --build build-wsl -j12
HAPLOKIT_CPP_BIN=$PWD/build-wsl/haplokit_cpp python -m pytest -q tests/python
ctest --test-dir build-wsl --output-on-failure
See:
docs/specs/haplokit-view-cli.mddocs/specs/haplokit-result-schema.mddocs/development/haplokit-linux-workflow.mddocs/release/pypi-release.md
Acknowledgements
haplokit is inspired by geneHapR:
Zhang, R., Jia, G. & Diao, X. geneHapR: an R package for gene haplotypic statistics and visualization. BMC Bioinformatics 24, 199 (2023). https://doi.org/10.1186/s12859-023-05318-9
License
GPL-3.0-or-later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file haplokit-0.1.0.tar.gz.
File metadata
- Download URL: haplokit-0.1.0.tar.gz
- Upload date:
- Size: 10.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f31ed857537722428f1acbb4d51fcc3d8816ecb131dc081802cbd131fe2067e8
|
|
| MD5 |
cc492065b49f10ecab16b8f1c7f80086
|
|
| BLAKE2b-256 |
b4b441d61f7964d8425d980f8acfe774d7495bee5c92e59e3e8b6d8a7e6f09ff
|