No project description provided
Project description
ExP Selection
ExP heatmap example - LCT gene
This is the ExP heatmap of human lactose (LCT) gene on chromosome 2 and its surrounding genomic region displaying population differences between 26 populations of 1000 Genomes Project, phase 3. Displayed values are the adjusted rank p-values for cross-population extended haplotype homozygosity (XPEHH) selection test.
Requirements
- python >= 3.8
- vcftools (repository)
- space on disk (.vcf files are usually quite large)
Install
pip install exp-selection
Workflow
Usage
Get the data
- VCF files (e.g. 1000 Genomes Project and Phase 3, chr22)
- Panel file (e.g. 1000 Genomes Project)
Prepare the data
Extract only SNP
You can give an .vcf or .vcf.gz file
# we will use SNPs only, so we remove insertion/deletion polymorphisms
# another option would be to use only biallelic SNPs (--min-alleles 2 --max-alleles 2),
# probably with minor allele frequency above 5% (--maf 0.05)
# ouput VCF will be named DATA.recode.vcf
# Gziped VCF
vcftools --gzvcf DATA.vcf.gz --remove-indels --recode --recode-INFO-all --out DATA
# Plain VCF
vcftools --vcf DATA.vcf --remove-indels --recode --recode-INFO-all --out DATA
Prepare data for computing
# DATA.recode.vcf a vcf from previous step
# DATA.zarr is path (folder) where zarr representation of the VCF input will be saved
exp-selection prepare DATA.recode.vcf DATA.zarr
Compute pairwise values
# DATA.zarr a zarr data from previous step
# DATA.output a path (folder) where the results will be saved
# in this step, by default Cross-population extended haplotype homozygosity (XPEHH) score will be computed for all positions provided together with their -log10 rank p-values.
exp-selection compute DATA.zarr genotypes.panel DATA.output
Display ExP heatmap
--begin
,--end
(required)- plot boundaries
--title
(optional)- name of the image
--cmap
(optional)- color schema
- more informations at seaborn package
--output
(optional)- png output path
exp-selection plot DATA.output --begin BEING --end END --title TITLE --output NAME
Example
This example shows an analysis of 1000 Genomes Project, phase 3 data of chromosome 22, chosen especially for its small size and thus reasonable fast computations. It is focused on ADM2 gene (link), which is active especially in reproductive system, and angiogenesis and cardiovascular system in general.
# Download datasets
wget "ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr22_GRCh38.genotypes.20170504.vcf.gz" -O chr22.genotypes.vcf.gz
wget "ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel" -O genotypes.panel
# The 1000 Genomes Project ftp seems not working, you can get the VCF files (GRCh37 version) at this mirror
wget "https://ddbj.nig.ac.jp/public/mirror_database/1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" -O chr22.genotypes.vcf.gz
wget "https://ddbj.nig.ac.jp/public/mirror_database/1000genomes/release/20130502/integrated_call_samples_v3.20130502.ALL.panel" -O genotypes.panel
# Compute files for graph
vcftools --gzvcf chr22.genotypes.vcf.gz \
--remove-indels \
--recode \
--recode-INFO-all \
--out chr22.genotypes
exp-selection prepare chr22.genotypes.recode.vcf chr22.genotypes.recode.zarr
exp-selection compute chr22.genotypes.recode.zarr genotypes.panel chr22.genotypes.output
# Plot heatmap
exp-selection plot chr22.genotypes.output --begin 50481556 --end 50486440 --title ADM2 --output adm2_GRCh38
exp-selection plot chr22.genotypes.output --begin 50910000 --end 50950000 --title ADM2 --output adm2_GRCh37 # use this plotting if you use GRCh37 version of the VCF input files.
# The heatmap is saved as adm2_GRCh38.png or adm2_GRCh37.png, depending on which version of plot function are you using.
Contributors
Acknowledgement
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.