Framework for haplotype clustering in phased genotype data
Project description
hapla (v0.40.0)
hapla is a framework for performing window-based haplotype clustering in phased genotype data. The inferred haplotype cluster alleles can be used to infer fine-scale population structure, perform polygenic prediction and haplotype cluster based association studies.
Citation
Please cite our papers.
hapla cluster
Paper in Nature Communications
Preprint available on medRxiv
hapla admix
Preprint available on bioRxiv
Installation
# Option 1: Build and install via PyPI
pip install hapla
# Option 2: Download source and install via pip
git clone https://github.com/Rosemeis/hapla.git
cd hapla
pip install .
# Option 3: Download source and install in a new Conda environment
git clone https://github.com/Rosemeis/hapla.git
conda env create -f hapla/environment.yml
conda activate hapla
You can now run the hapla software and the subcommands.
If you run into issues with your installation on a HPC system, it could be due to a mismatch of CPU architectures between login and compute nodes (illegal instruction). You can try and remove every instance of the march=native compiler flag in the setup.py file which optimizes hapla to your specific hardware setup. Another alternative is to use the uv package manager, where you can run hapla in a temporary and isolated environment by simply adding uvx in front of the hapla command.
Quick start
hapla contains the following subcommands at this moment:
hapla clusterhapla predicthapla structhapla admixhapla fatash
Haplotype clustering
hapla cluster
Window-based haplotype clustering in a phased VCF/BCF file (including index).
# Cluster haplotypes in a chromosome with fixed window size (16 SNPs)
hapla cluster --bcf data.chr1.bcf --size 16 --threads 8 --out hapla.chr1
# Saves inferred haplotype cluster assignments in binary hapla format
# - hapla.chr1.bca
# - hapla.chr1.ids
# - hapla.chr1.win
hapla cluster outputs three files. A .bca-file (binary cluster assignments), which stores the cluster assignments as unsigned chars, a .ids-file with sample names and a .win-file with information about the genomic windows.
# Cluster haplotypes in a chromosome with fixed size and overlapping windows (step size 8 SNPs)
hapla cluster --bcf data.chr1.bcf --size 16 --step 8 --threads 8 --out hapla.chr1
# Cluster haplotypes in all chromosomes and save output path in a filelist
for c in {1..22}
do
hapla cluster --bcf data.chr${c}.bcf --size 16 --threads 8 --out hapla.chr${c}
echo "hapla.chr${c}" >> hapla.filelist
done
Optionally, the haplotype cluster alleles can be saved in binary PLINK format (.bed, .bim, .fam) for ease of use with other software. Note that window information needs to be inferred from .bim-file for downstream analyses in this case.
hapla cluster --bcf data.chr1.bcf --threads 8 --out hapla.chr1 --plink
# Saves inferred haplotype cluster alleles in a binary PLINK format
# - hapla.chr1.bed
# - hapla.chr1.bim
# - hapla.chr1.fam
The number of inferred haplotype clusters will depend on the chosen window size (--size), the number of allowed clusters per window (--max-clusters), as well as $\lambda$ (--lmbda) and the minimum haplotype cluster size. $\lambda$ represents the fraction of the specified window size in SNPs, which is required to create a new cluster based on Hamming distance, with a default setting of --lmbda 0.125. The minimum haplotype cluster size can be adjusted using either --min-freq or --min-mac. The default setting is a minimum haplotype cluster frequency of at least 0.005 for the cluster to be retained (--min-freq 0.005), using --min-mac will override any setting for --min-freq. Smaller clusters will be iteratively removed.
Predict haplotype cluster assignments
hapla predict
Predict haplotype cluster assignments using pre-computed cluster medians in a new set of haplotypes (VCF/BCF format). SNP sets must be overlapping.
# Cluster haplotypes in a chromosome with 'hapla cluster' and save cluster medians (--medians)
hapla cluster --bcf ref.chr1.bcf --size 8 --threads 64 --out ref.chr1 --medians
# Saves haplotype cluster medians (besides standard binary hapla format)
# - ref.chr1.bcm
# - ref.chr1.blk
# - ref.chr1.wix
# Predict assignments in a set of new haplotypes using haplotype cluster medians
hapla predict --bcf new.chr1.bcf --ref ref.chr1 --threads 64 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
# - new.chr1.bca
# - new.chr1.ids
# - new.chr1.win
Using --medians in hapla cluster outputs three extra files. A .bcm-file (binary cluster medians), which stores the cluster medians as unsigned chars, a .blk-file, which stores pairwise log-likelihoods between the cluster medians, a .wix-file with window index information. The files are needed to predict haplotype clusters in a new set of haplotypes.
(Prototype) Predict haplotype cluster assignments using pre-computed cluster medians in an unphased genotype dataset (VCF/BCF or binary PLINK format). SNP sets must be overlapping. NOT suitable for local ancestry inference.
# Predict assignments in an unphased genotype dataset in VCF/BCF format (same command as above)
hapla predict --bcf new.chr1.bcf --ref ref.chr1 --threads 64 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
# - new.chr1.bca
# - new.chr1.ids
# - new.chr1.win
# Predict assignments in an unphased genotype dataset in binary PLINK format (provide with file-prefix)
hapla predict --bfile new.chr1 --ref ref.chr1 --threads 64 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
# - new.chr1.bca
# - new.chr1.ids
# - new.chr1.win
Population structure inference and GRM estimation
hapla struct
Infer population structure and estimate genome-wide relationship matrix (GRM) using haplotype cluster alleles.
# Perform PCA on a single chromosome and extract top 20 eigenvectors
hapla struct --clusters hapla.chr1 --threads 64 --pca 20 --out hapla.chr1
# Saves eigenvalues and eigenvectors in text-format
# - hapla.chr1.eigenvecs
# - hapla.chr1.eigenvals
# Perform PCA on all chromosomes (genome-wide) using filelist and extract top 20 eigenvectors. Save loadings and haplotype cluster frequencies.
hapla struct --filelist hapla.filelist --threads 64 --pca 20 --out hapla --loadings
# Saves eigenvalues and eigenvectors in text-format
# - hapla.eigenvecs
# - hapla.eigenvals
# - hapla.loadings
# - hapla.freqs
# Construct genome-wide relationship matrix (GRM)
hapla struct --filelist hapla.filelist --threads 64 --grm --out hapla
# Saves the GRM in binary GCTA format (float)
# - hapla.grm.bin
# - hapla.grm.N.bin
# - hapla.grm.id
# Project samples on to existing PC space and extract eigenvectors
hapla struct --filelist new.filelist --threads 64 --out new --projection hapla
# Saves eigenvalues and eigenvectors in text-format
# - new.project.eigenvecs
Ancestry estimation
hapla admix
Estimate ancestry proportions and ancestral haplotype cluster frequencies with a pre-specified number of sources (K). Using a modified fastmixture and HaploNet model for use with our haplotype clusters. Projection and supervised modes are also available.
# Estimate ancestry proportions assuming K=3 ancestral sources for a single chromosome
hapla admix --clusters hapla.chr1 --K 3 --seed 1 --threads 64 --out hapla.chr1
# Saves Q and P matrices in text-format
# - hapla.chr1.K3.s1.Q
# - hapla.chr1.K3.s1.P
# Estimate ancestry proportions assuming K=3 ancestral sources using filelist with all chromosomes
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 64 --out hapla
# Saves Q matrix and file-specific P matrices in text-format, including a filelist of the P matrices
# - hapla.K3.s1.Q
# - hapla.K3.s1.chr{1..22}.P
# - hapla.K3.s1.pfilelist
# Estimate ancestry proportions in projection mode assuming K=3 ancestral sources using filelist with all chromosomes. Provide previously estimated ancestral haplotype cluster frequencies.
hapla admix --filelist new.filelist --K 3 --seed 1 --threads 64 --projection hapla.K3.s1.pfilelist --out new
# Saves Q matrix in text-format
# - new.project.K3.s1.Q
# Estimate ancestry proportions in supervised mode assuming K=3 ancestral sources using filelist with all chromosomes. Provide a single column text-file with population labels of the samples as integers, where 0 indicates no label.
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 64 --supervised hapla.labels --out hapla.super
# Saves Q and P matrices in text-format, including a filelist of the P matrices
# - hapla.super.K3.s1.Q
# - hapla.super.K3.s1.chr{1..22}.P
# - hapla.super.K3.s1.pfilelist
Local ancestry inference (Prototype)
hapla fatash
Infer local ancestry tracts using the admixture estimation from hapla admix in a hidden markov model. Based on a modified fastPHASE model for use with our haplotype clusters.
# Infer local ancestry tracts for a single chromosome (posterior decoding)
hapla fatash --clusters hapla.chr1 --qfile hapla.chr1.K3.s1.Q --pfile hapla.chr1.K3.s1.P --threads 16 --out hapla.chr1
# Saves posterior decoding path in text-format
# - hapla.chr1.path
# Infer local ancestry tracts using filelist with all chromosomes (Viterbi decoding)
hapla fatash --filelist hapla.filelist --qfile hapla.K3.s1.Q --pfilelist hapla.K3.s1.pfilelist --threads 16 --out hapla --viterbi
# Saves Viterbi decoding paths in text-files
# - hapla.chr{1..22}.path
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hapla-0.40.0.tar.gz.
File metadata
- Download URL: hapla-0.40.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc8b8a276014b100035c5a6bcfaf97349bcbaf6d6093c3f69f589154cef079e3
|
|
| MD5 |
72bfd64a2626450c7c9f75436ea830c8
|
|
| BLAKE2b-256 |
562056772652961af507b0c77159193811696852d5bf880d1042988e395d82ba
|