Genetic diversity metrics from popoulation genomic datasets.
Pypgen provides various utilities for estimating standard genetic diversity measures including Gst, G’st, G’‘st, and Jost’s D from large genomic datasets (Hedrick, 2005; Jost, 2008; Masatoshi Nei, 1973; Nei & Chesser, 1983). Pypgen operates both on individual SNPs as well as on user defined regions (e.g., five kilobase windows tiled across each chromosome). For the windowed analyses pypgen estimates the multi-locus versions of each estimator.
- Handles multiallelic SNP calls
- Allows a single VCF file to contain multiple populations
- Operates on standard VCF (Variant Call Format) formatted SNP calls
- Uses bgziped input for fast random access
- Takes advantage of multiple processor cores
- Calculates additional metrics:
- snp count per window
- mean read depth (+/- STDEV) per window
- populations with fixed alleles per SNP
- more as I think of them
PYPGEN IS STILL IN ACTIVE DEVELOPMENT AND ALMOST CERTAINLY CONTAINS BUGS. If you find a bug please file a report in the issues section of the github repository and I’ll address it as soon as I can.
- Sliding window analysis (vcf_sliding_window.py)
- Per SNP analysis (vcf_snpwise_fstats.py)
First install samtools. On OS X I recommend using homebrew to do this. Once you have samtools installed and available in terminal you can use either pip or setuptools to install the current release of pypgen:
pip install pypgen
Alternately, if you like to live on the edge, you can clone and install the current development version from github.
pip install -e git+https://github.com/ngcrawford/pypgen.git
More detailed documentation will be forthcoming, but in the meantime information about each script can be obtained by running:
python [script name].py --help
Note: this will probably change.
- chrm = Name of chromosome
- start = Starting position of window
- stop = Ending position of window
- snp_count = Total Number of SNPs in window
- total_depth_mean = Mean read depth across window
- total_depth_stdev = Standard deviation of read depth across window
- Pop1.sample_count.mean = Mean number of samples per snp for ‘Pop1’
- Pop1.sample_count.stdev = Standard deviation of samples per snp for - ‘Pop1’
- Pop2.sample_count.mean = Mean number of samples per snp for ‘Pop2’
- Pop2.sample_count.stdev = Standard deviation of samples per snp for ‘Pop2’
- Pop2.Pop1.D_est = Multilocus Dest (Jost 2008)
- Pop2.Pop1.G_double_prime_st_est = (Meirmans & Hedrick 2011)
- Pop2.Pop1.G_prime_st_est = Standardized Gst (Hedrick 2005)
- Pop2.Pop1.Gst_est = Fst corrected for sample size and allowing for multiallelic loci (Nei & Chesser 1983)
- chrm = Name of chromosome
- pos = Position of SNP
- outgroups = Number of samples
- Pop1 = Population ID
- Pop1.Pop2.D_est= Multilocus Dest (Jost 2008)
- Pop1.Pop2.G_double_prime_st_est = (Meirmans & Hedrick 2011)
- Pop1.Pop2.G_prime_st_est = Standardized Gst (Hedrick 2005)
- Pop1.Pop2.Gst_est = Fst corrected for sample size and allowing for multiallelic loci (Nei & Chesser 1983)
- Pop1_fixed = If a sample is fixed at a particular allele this flag is set to 1 (= “True” in binary).