Skip to main content

CLI haplotype viewer with C++ backend and Python plotting

Project description

haplokit

CLI haplotype viewer with bcftools-like selectors, C++ backend, and Python plotting.

English | 汉语

Installation

pip install haplokit

Source build requires Linux/WSL, Python 3.10+, C++17 toolchain, CMake 3.22+ — see Contributing.

Quick Start

haplokit view data/var.sorted.vcf.gz -r scaffold_1:4300-5000 --output-file out

Output:

  • out/hapresult.tsv — per-sample haplotype detail
  • out/hap_summary.tsv — haplotype count summary

Usage Scenarios

1. Region query — strict haplotype grouping

Identify all distinct haplotypes in a genomic region.

haplokit view in.vcf.gz -r chr1:1000-2000 --output-file out

Produces hapresult.tsv + hap_summary.tsv in out/. Each haplotype row shows the exact allele pattern; samples with any heterozygous or missing call are excluded.

2. Single-site query

Analyze haplotype at one variant position.

haplokit view in.vcf.gz -r chr1:1450 --output-file out_site

--by auto-resolves to site for chr:pos selectors.

3. Gene annotation + figure

Overlay gene structure on the haplotype table.

haplokit view in.vcf.gz -r chr1:1000-2000 --gff genes.gff3 --plot --output-file out

genes.gff3 format (standard GFF3):

chr1	.	gene	1000	3000	.	+	.	ID=gene1;Name=GeneA
chr1	.	CDS	1200	1500	.	+	0	ID=cds1;Parent=gene1

Adds SnpEff-style functional category strip (CDS, UTR, exon, intron, intergenic) above variant positions. Writes figure (out/*.png) + gff_ann_summary.tsv.

Haplotype summary table

Figure components:

  • Title: region + overlapping gene name (when --gff provided)
  • Function strip (--gff only): colored bar classifying each variant by functional category
  • POS / ALLELE rows: variant positions and alternate alleles
  • Haplotype rows (H001, H002, ...): allele per position; empty = reference
  • Population columns (--population): sample counts per haplotype per group
  • n/N: haplotype frequency
  • Legend (--gff only): functional category colors
  • Indel footnotes: multi-allele indels annotated with superscript markers

4. Population grouping

Compare haplotype distributions across populations.

haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --plot --output-file out

popgroup.txt (tab-separated: sample<TAB>population):

C1	wild
C2	wild
C13	landrace

Adds population columns to the table and figure.

5. Geographic distribution map

Map haplotype composition at sampling locations.

haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --geo sample_geo.txt --plot --output-file out

sample_geo.txt (tab-separated: ID<TAB>longitude<TAB>latitude):

ID	longitude	latitude
C1	116.40	39.90
C4	121.47	31.23
Haplotype geographic distribution

Figure components:

  • Pie charts: haplotype composition per location; size ∝ √(sample count)
  • Color legend (top-left): haplotype color key
  • Bubble-size legend (top-right): ggplot2-style graduated circles, showing the sample-count scale
  • Base map: GeoJSON province boundaries (China)

6. Haplotype network — popart-style

Build a TCS network (Templeton et al. 1992) and visualize it in the conventions of popart (Leigh & Bryant 2015).

haplokit view in.vcf.gz -r chr1:1000-2000 -p popgroup.txt --network --plot --output-file out

Figure components:

  • Nodes: one circle per haplotype; area ∝ √(sample count)
  • Pie slices (with -p): population composition per haplotype
  • Edges: ideal length proportional to mutation distance (force-directed layout)
  • Hatch marks across edges: one tick per mutation (popart convention)
  • Small black dots: inferred median (intermediate) vertices, where TCS infers ancestors

7. BED batch processing

Process multiple regions in one run.

haplokit view in.vcf.gz -R regions.bed --output-file out_batch

regions.bed (≥3 tab-separated columns):

chr1	1000	2000
chr2	5000	6000

Each BED row is processed independently. Output files are suffixed by region slug (_chr1_1000_2000).

8. Approximate grouping

Cluster similar haplotypes within a tolerance.

haplokit view in.vcf.gz -r chr1:1000-2000 --max-diff 0.2 --output-file out

--max-diff (0–1): haplotypes differing at ≤ 20% of positions merge into one group. Grouping mode changes from strict-region to approx-region.

9. Sample subset + imputation

Restrict analysis to specific samples; fill missing calls as reference.

haplokit view in.vcf.gz -r chr1:1000-2000 -S samples.list --impute --output-file out

samples.list (one sample ID per line):

C1
C5
C16

--impute treats missing GT as 0/0, increasing sample retention.

Output Files

hapresult.tsv — per-sample haplotype detail

CHR     scaffold_1  scaffold_1  ...  Haplotypes:  8
POS     4300        4345        ...  Individuals: 37
INFO    .           .           ...  Variants:    5
ALLELE  G/C         T/A,GG      ...  Accession
H001    G           T           ...  C8;C9;C11;C14;C18;C25;C26;C28;C31;C35
  • Header rows (CHR/POS/INFO/ALLELE): variant metadata across columns
  • Haplotype rows (H001–HNNN): allele at each position; empty = reference; list of samples carrying this haplotype

hap_summary.tsv — haplotype count summary

Same header as hapresult.tsv, plus a freq column (count/total):

H001  G   T   T   GCCTA  T   10
H002  G   T   T   A      T   8
H003  C   T   T   A      T   8

gff_ann_summary.tsv — gene annotation (--gff only)

chr           start  end   ann
scaffold_1    4300   5000  test1G0387

Figure files (--plot)

Format set by --plot-format (default png). Named per region slug: <prefix>.<chr>_<start>_<end>.png.

Full Parameters

haplokit view <input_vcf> (-r <region> | -R <regions.bed>) [options]

<input_vcf> must be an indexed VCF/BCF (.vcf.gz + .tbi, or BCF index).

Option Type Default Description
-r, --region string chr:start-end or chr:pos
-R, --regions-file path BED file (≥3 tab-separated columns)
-S, --samples-file path One sample ID per line
--by auto|region|site auto Grouping mode; auto infers from selector shape
--impute flag off Impute missing GT as reference
-g, --gff path GFF3/GTF for gene annotation
-p, --population path Tab-separated sample → population map
--output summary|detail summary JSONL mode only; TSV always writes both
--output-format tsv|jsonl tsv Output format
--output-file path Output directory, prefix, or JSONL file
--plot flag off Generate haplotype table figure
--plot-format png|pdf|svg|tiff png Figure format
--max-diff float [0,1] Approximate grouping threshold
--geo path Sample geographic coordinates for map
--network flag off Render haplotype network (popart-style TCS)

Selector rules: -r and -R are mutually exclusive and one is required. --by site only valid with -r chr:pos.

Backend

C++ backend (haplokit_cpp) handles VCF reading and haplotype grouping. Discovery order:

  1. HAPLOKIT_CPP_BIN env var
  2. Packaged binary: haplokit/_bin/haplokit_cpp
  3. Repo build: build-wsl/haplokit_cppbuild/haplokit_cpp
  4. Fallback: auto-run cmake build

Vendored libraries:

  • htslib — VCF/BCF reading with indexed random access
  • gffsub — GFF3/GTF parsing with overlap/nearest-gene queries

Contributing

cmake -S . -B build-wsl && cmake --build build-wsl -j12
HAPLOKIT_CPP_BIN=$PWD/build-wsl/haplokit_cpp python -m pytest -q tests/python
ctest --test-dir build-wsl --output-on-failure

Acknowledgements

Inspired by geneHapR:

Zhang, R., Jia, G. & Diao, X. geneHapR: an R package for gene haplotypic statistics and visualization. BMC Bioinformatics 24, 199 (2023). https://doi.org/10.1186/s12859-023-05318-9

Network visualization follows the conventions of popart:

Leigh, J. W. & Bryant, D. popart: full‐feature software for haplotype network construction. Methods in Ecology and Evolution 6, 1110–1116 (2015). https://doi.org/10.1111/2041-210X.12410

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haplokit-0.1.1.tar.gz (10.2 MB view details)

Uploaded Source

File details

Details for the file haplokit-0.1.1.tar.gz.

File metadata

  • Download URL: haplokit-0.1.1.tar.gz
  • Upload date:
  • Size: 10.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for haplokit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6816100adb2434ddb8d11ade56b6b2b2287d5eb4eb5d9c2b1aa0aa5d47e53c1a
MD5 f62d939ee1caff9c635f7246c6841ce6
BLAKE2b-256 bd63445c8ab9f790f27d3c3481f159024e8e4bb94370c9242674e12b0c392038

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page