Skip to main content

Framework for haplotype clustering in phased genotype data

Project description

hapla (v0.13)

hapla is a framework for performing window-based haplotype clustering in phased genotype data. The inferred haplotype cluster alleles can be used to infer fine-scale population structure, perform polygenic prediction and haplotype cluster based association studies.

Citation

medRxiv preprint

Installation

# Option 1: Build and install via PyPI
pip install hapla

# Option 2: Download source and install via pip
git clone https://github.com/Rosemeis/hapla.git
cd hapla
pip install .

# Option 3: Download source and install in a new Conda environment
git clone https://github.com/Rosemeis/hapla.git
conda env create -f hapla/environment.yml
conda activate hapla

You can now run the hapla software and the subcommands.

Quick start

hapla contains the following subcommands at this moment:

  • hapla cluster
  • hapla struct
  • hapla predict
  • hapla admix
  • hapla fatash

Haplotype clustering

hapla cluster
Window-based haplotype clustering in a phased VCF/BCF.

# Cluster haplotypes in a chromosome with fixed window size (8 SNPs)
hapla cluster --bcf data.chr1.bcf --size 8 --threads 16 --out hapla.chr1
# Saves inferred haplotype cluster assignments in binary hapla format
#	- hapla.chr1.bca
#	- hapla.chr1.ids
#	- hapla.chr1.win

hapla cluster outputs three files. A .bca-file (binary cluster assignments), which stores the cluster assignments as unsigned chars, a .ids-file with sample names and a .win-file with information about the genomic windows.

# Cluster haplotypes in a chromosome with fixed size and overlapping windows (step size 4)
hapla cluster --bcf data.chr1.bcf --size 8 --step 4 --threads 16 --out hapla.chr1

# Cluster haplotypes in all chromosomes and save output path in a filelist
for c in {1..22}
do
	hapla cluster --bcf data.chr${c}.bcf --size 8 --threads 16 --out hapla.chr${c}
	realpath hapla.chr${c} >> hapla.filelist
done

Optionally, the haplotype cluster alleles can be saved in binary PLINK format (.bed, .bim, .fam) for ease of use with other software. Note that window information needs to be inferred from .bim-file for downstream analyses in this case.

hapla cluster --bcf data.chr1.bcf --threads 16 --out hapla.chr1 --plink
# Saves inferred haplotype cluster alleles in a binary PLINK format
#	- hapla.chr1.bed
#	- hapla.chr1.bim
#	- hapla.chr1.fam

GRM estimation and population structure inference

hapla struct
Estimate genome-wide relationship matrix (GRM) and infer population structure using the haplotype cluster alleles.

# Construct genome-wide relationship matrix (GRM)
hapla struct --filelist hapla.filelist --threads 16 --grm --out hapla
# Saves the GRM in binary GCTA format (float)
#	- hapla.grm.bin
#	- hapla.grm.N.bin
#	- hapla.grm.id

# Perform PCA on all chromosomes (genome-wide) using filelist and extract top 20 eigenvectors
hapla struct --filelist hapla.filelist --threads 16 --pca 20 --out hapla
# Saves eigenvalues and eigenvectors in text-format
#	- hapla.eigenvecs
#	- hapla.eigenvals

# Or perform PCA on a single chromosome and extract top 20 eigenvectors
hapla struct --clusters hapla.chr1 --threads 16 --pca 20 --out hapla.chr1
# Saves eigenvalues and eigenvectors in text-format
#	- hapla.chr1.eigenvecs
#	- hapla.chr1.eigenvals

# A faster randomized SVD approach can also be utilized for large datasets (>5,000 individuals)
hapla struct --filelist hapla.filelist --threads 16 --pca 20 --randomized --out hapla

Predict haplotype cluster assignments

hapla predict
Predict haplotype cluster assignments using pre-computed cluster medians in a new set of haplotypes. SNP sets must be overlapping.

# Cluster haplotypes in a chromosome with 'hapla cluster' and save cluster medians (--medians)
hapla cluster --bcf ref.chr1.bcf --size 8 --threads 16 --out ref.chr1 --medians
# Saves haplotype cluster medians (besides standard binary hapla format)
#	- ref.chr1.bcm
#	- ref.chr1.wix
#	- ref.chr1.hcc

# Predict assignments in a set of new haplotypes using haplotype cluster medians
hapla predict --bcf new.chr1.bcf --threads 16 --out new.chr1 --ref ref.chr1
# Saves predicted haplotype cluster assignments in binary hapla format
#	- new.chr1.bca
#	- new.chr1.ids
#	- new.chr1.win

Using --medians in hapla cluster outputs three extra files. A .bcm-file (binary cluster medians), which stores the cluster medians as unsigned chars, a .wix-file with window index information and a .hcc-file with haplotype cluster counts. All files are needed to predict haplotype clusters in a new set of haplotypes.

Admixture estimation

hapla admix
Estimate ancestry proportions and ancestral haplotype cluster frequencies with a pre-specified number of sources (K). Using a modified ADMIXTURE model for haplotype clusters.

# Estimate ancestry proportions assuming K=3 ancestral sources for a single chromosome
hapla admix --clusters hapla.chr1 --K 3 --seed 1 --threads 16 --out hapla.chr1
# Saves Q matrix and P matrix in a text-file format
#	- hapla.chr1.K3.s1.Q
#	- hapla.chr1.K3.s1.P

# Estimate ancestry proportions assuming K=3 ancestral sources using filelist with all chromosomes
hapla admix --filelist hapla.filelist --K 3 --seed 1 --threads 16 --out hapla
# Saves Q matrix in a text-file and separate text-files of P matrices for each file
#	- hapla.K3.s1.Q
#	- hapla.K3.s1.file{1..22}.P

Local ancestry inference

hapla fatash
Infer local ancestry tracts using the admixture estimation in a hidden markov model. Using a modified fastPHASE model for haplotype clusters.

# Infer local ancestry tracts for a single chromosome (posterior decoding)
hapla fatash --clusters hapla.chr1 --qfile hapla.chr1.K3.s1.Q --pfile hapla.chr1.K3.s1.P --threads 16 --out hapla.chr1
# Saves posterior decoding path in text-format
#	- hapla.chr1.path

# Infer local ancestry tracts using filelist with all chromosomes (Viterbi decoding)
for c in {1..22}; do realpath hapla.chr1.K3.s1.file${c}.P >> hapla.K3.s1.pfilelist; done
hapla fatash --filelist hapla.filelist --qfile hapla.K3.s1.Q --pfilelist hapla.K3.s1.pfilelist --threads 16 --out hapla --viterbi
# Saves Viterbi decoding paths in text-files
#	- hapla.file{1..22}.path

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hapla-0.13.tar.gz (971.1 kB view details)

Uploaded Source

Built Distribution

hapla-0.13-cp311-cp311-macosx_11_0_arm64.whl (514.6 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

File details

Details for the file hapla-0.13.tar.gz.

File metadata

  • Download URL: hapla-0.13.tar.gz
  • Upload date:
  • Size: 971.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for hapla-0.13.tar.gz
Algorithm Hash digest
SHA256 9848bf5274a444e78057a81e510165de42a5e71b7662f68c341c5df438b44526
MD5 b3caca0ffc39e3a11f78ed15078cc0d7
BLAKE2b-256 6f4752a508b1d7f2f5c98ede4156052dbcc8a9332e5095d8cadf475e50b6e3d9

See more details on using hashes here.

File details

Details for the file hapla-0.13-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hapla-0.13-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 84b2362ca7b8acf441bbf9b94bd2ebcb33e6be307d17cb17f5bfa89a88136637
MD5 9be5680b42c56bcb9c5b94dba5e8c86d
BLAKE2b-256 68acc3fff3a4b83c7e7845682c6a9be0887a6c140bdafaad96db217422d178ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page