Methods for selective sweep inference
Project description
CAUTION: I did some re-organizing. Testing them out, debugging in late November.
Overview
Acronym: incomplete Selective sweep With Extended haplotypes Estimation Procedure
This software presents statistical methods to study very recent, very strong positive selection.
- By very recent, we mean within the last 500 generations
- By very strong, we mean selection coefficient s >= 0.015 (1.5%)
The methods relate the lengths of IBD tracts to a coalescent model under selection.
We assume 1 selected allele at a biallelic locus.
There are ~ seven major ideas.
- A genome-wide selection scan that looks for anomalously large IBD rates
- Inferring anomalously large IBD clusters
- Ranking alleles based on evidence for selection
- Computing a measure of cluster agglomeration (an IBD information entropy)
- Estimating allele frequency of unknown sweeping allele
- Estimating selection coefficients (w/ nice statistical properties)
- Estimating a confidence interval (w/ nice statistical properties)
These steps are implemented automatically in a snakemake
pipeline.
The input data is:
- Whole genome sequences
- Probably at least > 500 diploids
- Phased vcf data 0|1
- No apparent population structure
- No apparent close relatedness
- A genetic map (bp ---> cM)
- Recombining diploid chromosomes
- Not extended to human X chromosome (yet?)
- Access to cluster computing
- You should have at least 25 Gb of RAM and 6 CPUs on a node
- More for larger datasets
- Have not extended to cloud computing (yet?)
- You should have at least 25 Gb of RAM and 6 CPUs on a node
Installation
git clone https://github.com/sdtemple/isweep.git
conda env create -f isweep-environment.yml
conda activate isweep
python -c 'import site; print(site.getsitepackages())'
- Download other software
- See below
- Remember where you put these
- Put them in the same place
- You need to cite these software
Computer Overview
This repository contains a Python package and some Snakemake bioinformatics pipelines.
- The package ---> src/
- The pipelines ---> workflow/
You should run all snakemake
pipelines in their workflow/some-pipeline/
.
You should be in the conda activate isweep
environment for analyses.
You should run the analyses using cluster jobs.
We have made README.md files in most subfolders.
Running the isweep procedure:
This is the overall procedure. You will see more details for each step in workflow/some-pipeline/README.md files.
Phase data w/ Beagle or Shapeit beforehand.
- Make pointers to large (phased) vcf files
- Edit yaml files in the different workflow directories
- Run the selection scan (workflow/scan)
nohup snakemake -s Snakefile-scan.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &
- Check the *.log files from ibd-ends. If it recommends an estimated err, change the error rate in yaml.
- Estimate recent effective sizes (workflow/scan)
workflow/scan/terminalscripts/run-ibdne.sh
- Make the Manhattan plot (workflow/scan)
workflow/scan/terminalscripts/manhattan.py
- Checkout the roi.tsv file
cat roi.tsv
- Edit it with locus names if you want
- Run the region of interest analysis (workflow/roi)
nohup snakemake -s Snakefile-roi.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &
Tip: define variables for file, folder names, e.g., variable1=1224
then echo $variable1
Other software
Put these in a common folder!
- Beagle (https://faculty.washington.edu/browning/beagle/beagle.html)
- hap-ibd.jar (https://github.com/browning-lab/hap-ibd)
- ibd-ends.jar (https://github.com/browning-lab/ibd-ends)
- filter-lines.jar (https://faculty.washington.edu/browning/beagle_utilities/utilities.html)
- ibdne.jar (https://faculty.washington.edu/browning/ibdne.html)
- IBDkin (https://github.com/YingZhou001/IBDkin)
- bcftools
- tabix
- gzip
bcftools, tabix, gzip are in the isweep conda environment.
If you perform simulation study, you require more software.
- SLiM (https://messerlab.org/slim/)
- add-uniform-err.jar (https://faculty.washington.edu/browning/beagle_utilities/utilities.html)
If you want to compare against other methods (using our pipelines), you require more software.
See workflow/other-methods/
folder.
Citation
Please cite if you use this package.
Temple, S.D., Waples, R.K., Browning, S.R. (2023) "Modeling recent positive selection in Americans of European ancestry" https://www.biorxiv.org/content/10.1101/2023.11.13.566947v1
Do these methods apply to my sample size?
It'd be nice if you have:
- Ancestral Ne >= 3000
- No population bottleneck with Ne < 3000 in last 500 generations
- Sample from 1 generation
- I.e., use other methods for flies, mosquitoes, worms
You could test if there is enough IBD:
- In python, use the
simulate_ibd
andread_Ne
functions inisweep
- Ne file is tab-separated w/ two columns: generation, Ne
- You should make at least 500 generations
- It is okay to have a guess of Ne size before inferring with IBDNe
- Do you have # of ibd tracts >= 200? Then, probably yes.
Some species from stdpopsim
that may apply:
- Great apes
- Pongo abelii
- Pan troglodytes
- Papio anubis
- Anas platyrhynchos (duck)
- Apis mellifera (bee)
- Canis familiaris (dog)
- Address population structure, admixture
Development things to do
- Provide some scripts to summarize analyses
- Give example commands, scripts for pre-processing steps
- Change all snakescripts/ to terminalscripts/
- Severely simplify the yaml files
- So many parameters for methods development, not user-friendly
- Test performance in array data, less dense sequence data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.