Skip to main content

Framework for haplotype clustering in phased genotype data

Project description

hapla (v0.61.0)

hapla is a framework for performing window-based haplotype clustering in phased genotype data. The inferred haplotype cluster alleles can be used to infer fine-scale population structure, perform polygenic prediction and haplotype cluster based association studies.

Citation

Please cite our papers.

hapla cluster
Paper in Nature Communications
Preprint available on medRxiv

hapla admix
Paper in HGG Advances
Preprint available on bioRxiv

Installation

# Option 1: Build and install via PyPI
pip install hapla

# Option 2: Download source and install via pip
git clone https://github.com/Rosemeis/hapla.git
cd hapla
pip install .

# Option 3: Download source and install in a new Conda environment
git clone https://github.com/Rosemeis/hapla.git
conda env create -f hapla/environment.yml
conda activate hapla

You can now run the hapla software and the subcommands.

If you run into issues with your installation on a HPC system, it could be due to a mismatch of CPU architectures between login and compute nodes (illegal instruction). You can try and remove every instance of the march=native compiler flag in the setup.py file which optimizes hapla to your specific hardware setup. Another alternative is to use the uv package manager, where you can run hapla in a temporary and isolated environment by simply adding uvx in front of the hapla command.

Quick start

hapla contains the following subcommands at this moment:

  • hapla cluster
  • hapla predict
  • hapla struct
  • hapla admix
  • hapla fatash

Haplotype clustering

hapla cluster
Window-based haplotype clustering in a phased VCF/BCF file (including index).

# Cluster haplotypes in a chromosome with fixed window size (16 SNPs)
hapla cluster --bcf data.chr1.bcf --size 16 --threads 8 --out hapla.chr1
# Saves inferred haplotype cluster assignments in binary hapla format
#	- hapla.chr1.bca
#	- hapla.chr1.ids
#	- hapla.chr1.win

hapla cluster outputs three files. A .bca-file (binary cluster assignments), which stores the cluster assignments as unsigned chars, a .ids-file with sample names and a .win-file with information about the genomic windows.

# Cluster haplotypes in a chromosome with fixed size and overlapping windows (step size 8 SNPs)
hapla cluster --bcf data.chr1.bcf --size 16 --step 8 --threads 8 --out hapla.chr1

# Cluster haplotypes in all chromosomes and save output path in a filelist
for c in {1..22}
do
    hapla cluster --bcf data.chr${c}.bcf --size 16 --threads 8 --out hapla.chr${c}
    echo "hapla.chr${c}" >> hapla.filelist
done

Optionally, the haplotype cluster alleles can be saved in binary PLINK format (.bed, .bim, .fam) for ease of use with other software. Note that window information needs to be inferred from .bim-file for downstream analyses in this case.

hapla cluster --bcf data.chr1.bcf --threads 8 --out hapla.chr1 --plink
# Saves inferred haplotype cluster alleles in a binary PLINK format
#	- hapla.chr1.bed
#	- hapla.chr1.bim
#	- hapla.chr1.fam

The number of inferred haplotype clusters will depend on the chosen window size (--size), the number of allowed clusters per window (--max-clusters), as well as $\lambda$ (--lmbda) and the minimum haplotype cluster size. $\lambda$ represents the fraction of the specified window size in SNPs, which is required to create a new cluster based on Hamming distance, with a default setting of --lmbda 0.1. The minimum haplotype cluster size can be adjusted using either --min-freq or --min-mac. The default setting is a minimum haplotype cluster frequency of at least 0.005 for the cluster to be retained (--min-freq 0.005), using --min-mac will override any setting for --min-freq. Smaller clusters will be iteratively removed until all inferred clusters meet the frequency/count criterion.

For larger sample sizes (N > 1000), prune outlier clusters (singletons and doubletons) before re-clustering step, using the --prune option, to speed up the haplotype clustering while producing nearly identical clustering results.

Predict haplotype cluster assignments

hapla predict
Predict haplotype cluster assignments using pre-computed cluster medians in a new set of haplotypes (VCF/BCF format). SNP sets must be overlapping.

# Cluster haplotypes in a chromosome with 'hapla cluster' and save cluster medians (--medians)
hapla cluster --bcf ref.chr1.bcf --size 8 --threads 8 --out ref.chr1 --medians
# Saves haplotype cluster medians (besides standard binary hapla format)
#	- ref.chr1.bcm
#	- ref.chr1.blk
#	- ref.chr1.wix

# Predict assignments in a set of new haplotypes using haplotype cluster medians
hapla predict --bcf new.chr1.bcf  --ref ref.chr1 --threads 8 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

Using --medians in hapla cluster outputs three extra files. A .bcm-file (binary cluster medians), which stores the cluster medians as unsigned chars, a .blk-file, which stores pairwise log-likelihoods between the cluster medians, a .wix-file with window index information. The files are needed to predict haplotype clusters in a new set of haplotypes.

(Prototype) Predict haplotype cluster assignments using pre-computed cluster medians in an unphased genotype dataset (VCF/BCF or binary PLINK format). SNP sets must be overlapping. NOT suitable for local ancestry inference.

# Predict assignments in an unphased genotype dataset in VCF/BCF format (same command as above)
hapla predict --bcf new.chr1.bcf  --ref ref.chr1 --threads 8 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

# Predict assignments in an unphased genotype dataset in binary PLINK format (provide with file-prefix)
hapla predict --bfile new.chr1 --ref ref.chr1 --threads 8 --out new.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

Population structure inference and GRM estimation

hapla struct
Infer population structure and estimate genome-wide relationship matrix (GRM) using haplotype cluster alleles.

# Perform PCA on a single chromosome and extract top 20 eigenvectors
hapla struct --clusters hapla.chr1 --threads 32 --pca 20 --out hapla.chr1
# Saves eigenvalues and eigenvectors in text-format
#	- hapla.chr1.eigenvecs
#	- hapla.chr1.eigenvals

# Perform PCA on all chromosomes (genome-wide) using filelist and extract top 20 eigenvectors. Save loadings and haplotype cluster frequencies.
hapla struct --filelist hapla.filelist --threads 32 --pca 20 --out hapla --loadings
# Saves eigenvalues and eigenvectors in text-format
#	- hapla.eigenvecs
#	- hapla.eigenvals
#	- hapla.loadings
#	- hapla.freqs

# Construct genome-wide relationship matrix (GRM)
hapla struct --filelist hapla.filelist --threads 32 --grm --out hapla
# Saves the GRM in binary GCTA format (float)
#	- hapla.grm.bin
#	- hapla.grm.N.bin
#	- hapla.grm.id

# Project samples on to existing PC space and extract eigenvectors
hapla struct --filelist new.filelist --threads 32 --out new --projection hapla
# Saves eigenvalues and eigenvectors in text-format
#	- new.project.eigenvecs

Ancestry estimation

hapla admix
Estimate ancestry proportions and ancestral haplotype cluster frequencies with a pre-specified number of sources (K). Using a modified fastmixture and HaploNet model for use with our haplotype clusters. Projection and supervised modes are also available.

# Estimate ancestry proportions assuming K=3 ancestral sources for a single chromosome
hapla admix --clusters hapla.chr1 --K 3 --seed 1 --threads 32 --out hapla.chr1
# Saves Q and P matrices in text-format
#	- hapla.chr1.K3.s1.Q
#	- hapla.chr1.K3.s1.P

# Estimate ancestry proportions assuming K=3 ancestral sources using filelist with all chromosomes
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 32 --out hapla
# Saves Q matrix and file-specific P matrices in text-format, including a filelist of the P matrices
#	- hapla.K3.s1.Q
#	- hapla.K3.s1.chr{1..22}.P
#	- hapla.K3.s1.pfilelist

# Estimate ancestry proportions in projection mode assuming K=3 ancestral sources using filelist with all chromosomes. Provide previously estimated ancestral haplotype cluster frequencies.
hapla admix --filelist new.filelist --K 3 --seed 1 --threads 32 --projection hapla.K3.s1.pfilelist --out new
# Saves Q matrix in text-format
#	- new.project.K3.s1.Q

# Estimate ancestry proportions in supervised mode assuming K=3 ancestral sources using filelist with all chromosomes. Provide a single column text-file with population labels of the samples as integers, where 0 indicates no label.
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 32 --supervised hapla.labels --out hapla.super
# Saves Q and P matrices in text-format, including a filelist of the P matrices
#	- hapla.super.K3.s1.Q
#	- hapla.super.K3.s1.chr{1..22}.P
#	- hapla.super.K3.s1.pfilelist

Local ancestry inference (Prototype)

hapla fatash
Infer local ancestry tracts using the admixture estimation from hapla admix in a hidden markov model. Based on a modified fastPHASE model for use with our haplotype clusters.

# Infer local ancestry tracts for a single chromosome (posterior decoding)
hapla fatash --clusters hapla.chr1 --qfile hapla.chr1.K3.s1.Q --pfile hapla.chr1.K3.s1.P --threads 32 --out hapla.chr1
# Saves posterior decoding path in text-format
#	- hapla.chr1.path

# Infer local ancestry tracts using filelist with all chromosomes (Viterbi decoding)
hapla fatash --filelist hapla.filelist --qfile hapla.K3.s1.Q --pfilelist hapla.K3.s1.pfilelist --threads 32 --out hapla --viterbi
# Saves Viterbi decoding paths in text-files
#	- hapla.chr{1..22}.path

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hapla-0.61.0.tar.gz (1.1 MB view details)

Uploaded Source

File details

Details for the file hapla-0.61.0.tar.gz.

File metadata

  • Download URL: hapla-0.61.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for hapla-0.61.0.tar.gz
Algorithm Hash digest
SHA256 18079ccda2eaf31c821a2ce6f5a43d7253d648e7086ea5e6e8972c373241a97a
MD5 aabe86f8b1df7c63edfcd5b2534e5213
BLAKE2b-256 8ced89bdb97a4abedfc958bb0e36278b7cc298aecea91289fe23918fb764b3eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page