CLI haplotype viewer with C++ backend and Python plotting
Project description
haplokit
CLI haplotype viewer with bcftools-like selectors, C++ backend, and Python plotting.
English | 汉语
Installation
pip install haplokit
Source build requires Linux/WSL, Python 3.10+, C++17 toolchain, CMake 3.22+ — see Contributing.
Quick Start
haplokit view data/var.sorted.vcf.gz -r scaffold_1:4300-5000 --output-file out
Output:
out/hapresult.tsv— per-sample haplotype detailout/hap_summary.tsv— haplotype count summary
Usage Scenarios
1. Region query — strict haplotype grouping
Identify all distinct haplotypes in a genomic region.
haplokit view in.vcf.gz -r chr1:1000-2000 --output-file out
Produces hapresult.tsv + hap_summary.tsv in out/. Each haplotype row shows the exact allele pattern; samples with any heterozygous or missing call are excluded.
2. Single-site query
Analyze haplotype at one variant position.
haplokit view in.vcf.gz -r chr1:1450 --output-file out_site
--by auto-resolves to site for chr:pos selectors.
3. Gene annotation + figure
Overlay gene structure on the haplotype table.
haplokit view in.vcf.gz -r chr1:1000-2000 --gff genes.gff3 --plot --output-file out
genes.gff3 format (standard GFF3):
chr1 . gene 1000 3000 . + . ID=gene1;Name=GeneA
chr1 . CDS 1200 1500 . + 0 ID=cds1;Parent=gene1
Adds SnpEff-style functional category strip (CDS, UTR, exon, intron, intergenic) above variant positions. Writes figure (out/*.png) + gff_ann_summary.tsv.
Figure components:
- Title: region + overlapping gene name (when
--gffprovided) - Function strip (
--gffonly): colored bar classifying each variant by functional category - POS / ALLELE rows: variant positions and alternate alleles
- Haplotype rows (H001, H002, ...): allele per position; empty = reference
- Population columns (
--population): sample counts per haplotype per group - n/N: haplotype frequency
- Legend (
--gffonly): functional category colors - Indel footnotes: multi-allele indels annotated with superscript markers
4. Population grouping
Compare haplotype distributions across populations.
haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --plot --output-file out
popgroup.txt (tab-separated: sample<TAB>population):
C1 wild
C2 wild
C13 landrace
Adds population columns to the table and figure.
5. Geographic distribution map
Map haplotype composition at sampling locations.
haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --geo sample_geo.txt --plot --output-file out
sample_geo.txt (tab-separated: ID<TAB>longitude<TAB>latitude):
ID longitude latitude
C1 116.40 39.90
C4 121.47 31.23
Figure components:
- Pie charts: haplotype composition per location; size ∝ √(sample count)
- Color legend (top-left): haplotype color key
- Bubble-size legend (top-right): ggplot2-style graduated circles, showing the sample-count scale
- Base map: GeoJSON province boundaries (China)
6. Haplotype network — popart-style
Build a TCS network (Templeton et al. 1992) and visualize it in the conventions of popart (Leigh & Bryant 2015).
haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --network --plot --output-file out
Figure components:
- Nodes: one circle per haplotype; area ∝ √(sample count)
- Pie slices (with
-p): population composition per haplotype - Edges: ideal length proportional to mutation distance (force-directed layout)
- Hatch marks across edges: one tick per mutation (popart convention)
- Small black dots: inferred median (intermediate) vertices, where TCS infers ancestors
7. BED batch processing
Process multiple regions in one run.
haplokit view in.vcf.gz -R regions.bed --output-file out_batch
regions.bed (≥3 tab-separated columns):
chr1 1000 2000
chr2 5000 6000
Each BED row is processed independently. Output files are suffixed by region slug (_chr1_1000_2000).
8. Approximate grouping
Cluster similar haplotypes within a tolerance.
haplokit view in.vcf.gz -r chr1:1000-2000 --max-diff 0.2 --output-file out
--max-diff (0–1): haplotypes differing at ≤ 20% of positions merge into one group. Grouping mode changes from strict-region to approx-region.
9. Sample subset + imputation
Restrict analysis to specific samples; fill missing calls as reference.
haplokit view in.vcf.gz -r chr1:1000-2000 -S samples.list --impute --output-file out
samples.list (one sample ID per line):
C1
C5
C16
--impute treats missing GT as 0/0, increasing sample retention.
Output Files
hapresult.tsv — per-sample haplotype detail
CHR scaffold_1 scaffold_1 ... Haplotypes: 8
POS 4300 4345 ... Individuals: 37
INFO . . ... Variants: 5
ALLELE G/C T/A,GG ... Accession
H001 G T ... C8;C9;C11;C14;C18;C25;C26;C28;C31;C35
- Header rows (CHR/POS/INFO/ALLELE): variant metadata across columns
- Haplotype rows (H001–HNNN): allele at each position; empty = reference; list of samples carrying this haplotype
hap_summary.tsv — haplotype count summary
Same header as hapresult.tsv, plus a freq column (count/total):
H001 G T T GCCTA T 10
H002 G T T A T 8
H003 C T T A T 8
gff_ann_summary.tsv — gene annotation (--gff only)
chr start end ann
scaffold_1 4300 5000 test1G0387
Figure files (--plot)
Format set by --plot-format (default png). Named per region slug: <prefix>.<chr>_<start>_<end>.png.
Full Parameters
haplokit view <input_vcf> (-r <region> | -R <regions.bed>) [options]
<input_vcf> must be an indexed VCF/BCF (.vcf.gz + .tbi, or BCF index).
| Option | Type | Default | Description |
|---|---|---|---|
-r, --region |
string | — | chr:start-end or chr:pos |
-R, --regions-file |
path | — | BED file (≥3 tab-separated columns) |
-S, --samples-file |
path | — | One sample ID per line |
--by |
auto|region|site |
auto |
Grouping mode; auto infers from selector shape |
--impute |
flag | off | Impute missing GT as reference |
-g, --gff |
path | — | GFF3/GTF for gene annotation |
-p, --population |
path | — | Tab-separated sample → population map |
--output |
summary|detail |
summary |
JSONL mode only; TSV always writes both |
--output-format |
tsv|jsonl |
tsv |
Output format |
--output-file |
path | — | Output directory, prefix, or JSONL file |
--plot |
flag | off | Generate haplotype table figure |
--plot-format |
png|pdf|svg|tiff |
png |
Figure format |
--max-diff |
float [0,1] | — | Approximate grouping threshold |
--geo |
path | — | Sample geographic coordinates for map |
--network |
flag | off | Render haplotype network (popart-style TCS) |
Selector rules: -r and -R are mutually exclusive and one is required. --by site only valid with -r chr:pos.
Backend
C++ backend (haplokit_cpp) handles VCF reading and haplotype grouping. Discovery order:
HAPLOKIT_CPP_BINenv var- Packaged binary:
haplokit/_bin/haplokit_cpp - Repo build:
build-wsl/haplokit_cpp→build/haplokit_cpp - Fallback: auto-run
cmakebuild
Vendored libraries:
- htslib — VCF/BCF reading with indexed random access
- gffsub — GFF3/GTF parsing with overlap/nearest-gene queries
Contributing
cmake -S . -B build-wsl && cmake --build build-wsl -j12
HAPLOKIT_CPP_BIN=$PWD/build-wsl/haplokit_cpp python -m pytest -q tests/python
ctest --test-dir build-wsl --output-on-failure
Acknowledgements
Inspired by geneHapR:
Zhang, R., Jia, G. & Diao, X. geneHapR: an R package for gene haplotypic statistics and visualization. BMC Bioinformatics 24, 199 (2023). https://doi.org/10.1186/s12859-023-05318-9
Network visualization follows the conventions of popart:
Leigh, J. W. & Bryant, D. popart: full‐feature software for haplotype network construction. Methods in Ecology and Evolution 6, 1110–1116 (2015). https://doi.org/10.1111/2041-210X.12410
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file haplokit-0.1.1.tar.gz.
File metadata
- Download URL: haplokit-0.1.1.tar.gz
- Upload date:
- Size: 10.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6816100adb2434ddb8d11ade56b6b2b2287d5eb4eb5d9c2b1aa0aa5d47e53c1a
|
|
| MD5 |
f62d939ee1caff9c635f7246c6841ce6
|
|
| BLAKE2b-256 |
bd63445c8ab9f790f27d3c3481f159024e8e4bb94370c9242674e12b0c392038
|