Framework for haplotype clustering in phased genotype data

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

hapla (v0.40.0)

hapla is a framework for performing window-based haplotype clustering in phased genotype data. The inferred haplotype cluster alleles can be used to infer fine-scale population structure, perform polygenic prediction and haplotype cluster based association studies.

Citation

Please cite our papers.

hapla cluster
Paper in Nature Communications
Preprint available on medRxiv

hapla admix
Preprint available on bioRxiv

Installation

# Option 1: Build and install via PyPI
pip install hapla

# Option 2: Download source and install via pip
git clone https://github.com/Rosemeis/hapla.git
cd hapla
pip install .

# Option 3: Download source and install in a new Conda environment
git clone https://github.com/Rosemeis/hapla.git
conda env create -f hapla/environment.yml
conda activate hapla

You can now run the hapla software and the subcommands.

If you run into issues with your installation on a HPC system, it could be due to a mismatch of CPU architectures between login and compute nodes (illegal instruction). You can try and remove every instance of the march=native compiler flag in the setup.py file which optimizes hapla to your specific hardware setup. Another alternative is to use the uv package manager, where you can run hapla in a temporary and isolated environment by simply adding uvx in front of the hapla command.

Quick start

hapla contains the following subcommands at this moment:

hapla cluster
hapla predict
hapla struct
hapla admix
hapla fatash

Haplotype clustering

hapla cluster
Window-based haplotype clustering in a phased VCF/BCF file (including index).

# Cluster haplotypes in a chromosome with fixed window size (16 SNPs)
hapla cluster --bcf data.chr1.bcf --size 16 --threads 8 --out hapla.chr1
# Saves inferred haplotype cluster assignments in binary hapla format
#	- hapla.chr1.bca
#	- hapla.chr1.ids
#	- hapla.chr1.win

hapla cluster outputs three files. A .bca-file (binary cluster assignments), which stores the cluster assignments as unsigned chars, a .ids-file with sample names and a .win-file with information about the genomic windows.

# Cluster haplotypes in a chromosome with fixed size and overlapping windows (step size 8 SNPs)
hapla cluster --bcf data.chr1.bcf --size 16 --step 8 --threads 8 --out hapla.chr1

# Cluster haplotypes in all chromosomes and save output path in a filelist
for c in {1..22}
do
	hapla cluster --bcf data.chr${c}.bcf --size 16 --threads 8 --out hapla.chr${c}
	echo "hapla.chr${c}" >> hapla.filelist
done

Optionally, the haplotype cluster alleles can be saved in binary PLINK format (.bed, .bim, .fam) for ease of use with other software. Note that window information needs to be inferred from .bim-file for downstream analyses in this case.

hapla cluster --bcf data.chr1.bcf --threads 8 --out hapla.chr1 --plink
# Saves inferred haplotype cluster alleles in a binary PLINK format
#	- hapla.chr1.bed
#	- hapla.chr1.bim
#	- hapla.chr1.fam

The number of inferred haplotype clusters will depend on the chosen window size (--size), the number of allowed clusters per window (--max-clusters), as well as $\lambda$ (--lmbda) and the minimum haplotype cluster size. $\lambda$ represents the fraction of the specified window size in SNPs, which is required to create a new cluster based on Hamming distance, with a default setting of --lmbda 0.125. The minimum haplotype cluster size can be adjusted using either --min-freq or --min-mac. The default setting is a minimum haplotype cluster frequency of at least 0.005 for the cluster to be retained (--min-freq 0.005), using --min-mac will override any setting for --min-freq. Smaller clusters will be iteratively removed.

Predict haplotype cluster assignments

hapla predict
Predict haplotype cluster assignments using pre-computed cluster medians in a new set of haplotypes (VCF/BCF format). SNP sets must be overlapping.

# Cluster haplotypes in a chromosome with 'hapla cluster' and save cluster medians (--medians)
hapla cluster --bcf ref.chr1.bcf --size 8 --threads 64 --out ref.chr1 --medians
# Saves haplotype cluster medians (besides standard binary hapla format)
#	- ref.chr1.bcm
#	- ref.chr1.blk
#	- ref.chr1.wix

# Predict assignments in a set of new haplotypes using haplotype cluster medians
hapla predict --bcf new.chr1.bcf  --ref ref.chr1 --threads 64 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

Using --medians in hapla cluster outputs three extra files. A .bcm-file (binary cluster medians), which stores the cluster medians as unsigned chars, a .blk-file, which stores pairwise log-likelihoods between the cluster medians, a .wix-file with window index information. The files are needed to predict haplotype clusters in a new set of haplotypes.

(Prototype) Predict haplotype cluster assignments using pre-computed cluster medians in an unphased genotype dataset (VCF/BCF or binary PLINK format). SNP sets must be overlapping. NOT suitable for local ancestry inference.

# Predict assignments in an unphased genotype dataset in VCF/BCF format (same command as above)
hapla predict --bcf new.chr1.bcf  --ref ref.chr1 --threads 64 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

# Predict assignments in an unphased genotype dataset in binary PLINK format (provide with file-prefix)
hapla predict --bfile new.chr1 --ref ref.chr1 --threads 64 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

Population structure inference and GRM estimation

hapla struct
Infer population structure and estimate genome-wide relationship matrix (GRM) using haplotype cluster alleles.

# Perform PCA on a single chromosome and extract top 20 eigenvectors
hapla struct --clusters hapla.chr1 --threads 64 --pca 20 --out hapla.chr1
# Saves eigenvalues and eigenvectors in text-format
#	- hapla.chr1.eigenvecs
#	- hapla.chr1.eigenvals

# Perform PCA on all chromosomes (genome-wide) using filelist and extract top 20 eigenvectors. Save loadings and haplotype cluster frequencies.
hapla struct --filelist hapla.filelist --threads 64 --pca 20 --out hapla --loadings
# Saves eigenvalues and eigenvectors in text-format
#	- hapla.eigenvecs
#	- hapla.eigenvals
#	- hapla.loadings
#	- hapla.freqs

# Construct genome-wide relationship matrix (GRM)
hapla struct --filelist hapla.filelist --threads 64 --grm --out hapla
# Saves the GRM in binary GCTA format (float)
#	- hapla.grm.bin
#	- hapla.grm.N.bin
#	- hapla.grm.id

# Project samples on to existing PC space and extract eigenvectors
hapla struct --filelist new.filelist --threads 64 --out new --projection hapla
# Saves eigenvalues and eigenvectors in text-format
#	- new.project.eigenvecs

Ancestry estimation

hapla admix
Estimate ancestry proportions and ancestral haplotype cluster frequencies with a pre-specified number of sources (K). Using a modified fastmixture and HaploNet model for use with our haplotype clusters. Projection and supervised modes are also available.

# Estimate ancestry proportions assuming K=3 ancestral sources for a single chromosome
hapla admix --clusters hapla.chr1 --K 3 --seed 1 --threads 64 --out hapla.chr1
# Saves Q and P matrices in text-format
#	- hapla.chr1.K3.s1.Q
#	- hapla.chr1.K3.s1.P

# Estimate ancestry proportions assuming K=3 ancestral sources using filelist with all chromosomes
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 64 --out hapla
# Saves Q matrix and file-specific P matrices in text-format, including a filelist of the P matrices
#	- hapla.K3.s1.Q
#	- hapla.K3.s1.chr{1..22}.P
#	- hapla.K3.s1.pfilelist

# Estimate ancestry proportions in projection mode assuming K=3 ancestral sources using filelist with all chromosomes. Provide previously estimated ancestral haplotype cluster frequencies.
hapla admix --filelist new.filelist --K 3 --seed 1 --threads 64 --projection hapla.K3.s1.pfilelist --out new
# Saves Q matrix in text-format
#	- new.project.K3.s1.Q

# Estimate ancestry proportions in supervised mode assuming K=3 ancestral sources using filelist with all chromosomes. Provide a single column text-file with population labels of the samples as integers, where 0 indicates no label.
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 64 --supervised hapla.labels --out hapla.super
# Saves Q and P matrices in text-format, including a filelist of the P matrices
#	- hapla.super.K3.s1.Q
#	- hapla.super.K3.s1.chr{1..22}.P
#	- hapla.super.K3.s1.pfilelist

Local ancestry inference (Prototype)

hapla fatash
Infer local ancestry tracts using the admixture estimation from hapla admix in a hidden markov model. Based on a modified fastPHASE model for use with our haplotype clusters.

# Infer local ancestry tracts for a single chromosome (posterior decoding)
hapla fatash --clusters hapla.chr1 --qfile hapla.chr1.K3.s1.Q --pfile hapla.chr1.K3.s1.P --threads 16 --out hapla.chr1
# Saves posterior decoding path in text-format
#	- hapla.chr1.path

# Infer local ancestry tracts using filelist with all chromosomes (Viterbi decoding)
hapla fatash --filelist hapla.filelist --qfile hapla.K3.s1.Q --pfilelist hapla.K3.s1.pfilelist --threads 16 --out hapla --viterbi
# Saves Viterbi decoding paths in text-files
#	- hapla.chr{1..22}.path

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.62.0

Feb 25, 2026

0.61.0

Feb 4, 2026

0.60.0

Jan 28, 2026

0.52.0

Dec 18, 2025

0.51.0

Nov 12, 2025

0.50.0

Nov 4, 2025

0.41.0

Oct 31, 2025

This version

0.40.0

Sep 26, 2025

0.33.0

Sep 19, 2025

0.32.2

Sep 1, 2025

0.32.1

Aug 7, 2025

0.32.0

Aug 1, 2025

0.31.1

Jul 31, 2025

0.31.0

Jul 31, 2025

0.30.0

Jul 30, 2025

0.25.0

May 14, 2025

0.24.1

Apr 22, 2025

0.24.0

Apr 15, 2025

0.23.0

Apr 11, 2025

0.22.1

Apr 7, 2025

0.22.0

Mar 29, 2025

0.21.1

Mar 21, 2025

0.21.0

Mar 10, 2025

0.20.1

Feb 24, 2025

0.20.0

Feb 18, 2025

0.14.7

Feb 10, 2025

0.14.6

Feb 4, 2025

0.14.5

Jan 27, 2025

0.14.4

Jan 25, 2025

0.14.3

Jan 22, 2025

0.14.2

Jan 14, 2025

0.14.0

Jan 6, 2025

0.13

Nov 16, 2024

0.12

Oct 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hapla-0.40.0.tar.gz (1.1 MB view details)

Uploaded Sep 26, 2025 Source

File details

Details for the file hapla-0.40.0.tar.gz.

File metadata

Download URL: hapla-0.40.0.tar.gz
Upload date: Sep 26, 2025
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hapla-0.40.0.tar.gz
Algorithm	Hash digest
SHA256	`cc8b8a276014b100035c5a6bcfaf97349bcbaf6d6093c3f69f589154cef079e3`
MD5	`72bfd64a2626450c7c9f75436ea830c8`
BLAKE2b-256	`562056772652961af507b0c77159193811696852d5bf880d1042988e395d82ba`

See more details on using hashes here.

hapla 0.40.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hapla (v0.40.0)

Citation

Installation

Quick start

Haplotype clustering

Predict haplotype cluster assignments

Population structure inference and GRM estimation

Ancestry estimation

Local ancestry inference (Prototype)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes