Methods for selective sweep inference

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Statistical inference using long IBD segments

There is a major methodological update for multiple-testing corrections.

Please read misc/multiple-testing.md. You should read our citation below for more details.

See workflow/scan-case-control if you are here for IBD mapping, not selection.

See misc/usage.md to evaluate if this methodology fits your study.

See misc/cluster-options.md for some suggested cluster options to use in pipelines.

See on GitHub "Issues/Closed" for some comments about the pipeline.

Contact sethtem@umich.edu or Github issues for troubleshooting.

Citation

Please cite if you use this package.

Methods to model selection:

Temple, S.D., Waples, R.K., Browning, S.R. (2024). Modeling recent positive selection using identity-by-descent segments. The American Journal of Human Genetics. https://doi.org/10.1016/j.ajhg.2024.08.023.

IBD central limit theorems:

Temple, S.D., Thompson, E.A. (2024). Identity-by-descent segments in large samples. Preprint at bioRxiv, 2024.06.05.597656. https://www.biorxiv.org/content/10.1101/2024.06.05.597656v2.

Genome-wide significance thresholds in the selection scan:

Temple, S.D., Browning, S.B. (2024). "Multiple testing corrections in selection studies using identity-by-descent segments. Draft in progress.

Genome-wide significance thresholds in the IBD case-control mapping:

Temple, S.D., ..., Wijsman, E., and Blue, E. (2024-25). "Multiple testing corrections in case-control studies using identity-by-descent segments." Draft in progress.

Simulating IBD around a locus:

Temple, S.D., Browning, S.B., and Thompson, E.A. (2024). "Fast simulation of identity-by-descent segments." bioRxiv. https://www.biorxiv.org/content/10.1101/2024.12.13.628449v2

Unifying framework of the selection scan and sweep modeling:

Temple, S.D. (2024). "Statistical Inference using Identity-by-Descent Segments: Perspectives on Recent Positive Selection." PhD thesis (University of Washington). https://www.proquest.com/docview/3105584569?sourcetype=Dissertations%20&%20Theses.

Methodology

Acronym: incomplete Selective sweep With Extended haplotypes Estimation Procedure

This software presents methods to study recent, strong positive selection.

By recent, we mean within the last 500 generations.
By strong, we mean selection coefficient s >= 0.015 (1.5%).
Scan may have moderate power for s >= 0.01 (1%).

In modeling a sweep, we assume 1 selected allele at a locus.

Automated analysis pipeline(s):

A genome-wide selection scan for anomalously large IBD rates

With multiple testing correction

Inferring anomalously large IBD clusters
Ranking alleles based on evidence for selection
Computing a measure of cluster agglomeration (Gini impurity index)
Estimating frequency and location of unknown sweeping allele
Estimating a selection coefficient
Estimating a confidence interval

Step 1 may be standalone, depending on the analysis. (You may not care to model putative sweeps (Steps 2-7).)

The input data is:

See misc/usage.md.

Whole genome sequences
- Probably at least > 500 diploids
- Phased vcf data 0|1
- No apparent population structure
- No apparent close relatedness
- Tab-separated genetic map (bp ---> cM)
  - Without headers!
  - Columns are chromosome, rsID, cM, bp
- Recombining diploid autosomes
  - For haploids, see issue 5 "Not designed for ploidy != 2"
Access to cluster computing
- Not extended to cloud computing

Chromosome numbers in genetic maps should match chromosome numbers in VCFs.

Repository overview

This repository contains a Python package and some Snakemake bioinformatics pipelines.

The package ---> src/
The pipelines ---> workflow/

You should run all snakemake pipelines in their workflow/some-pipeline/.

You should be in the mamba activate isweep environment for analyses.

You should run the analyses using cluster jobs.

Installation

See misc/installing-mamba.md to get a Python package manager.

Clone the repository

git clone https://github.com/sdtemple/isweep.git

Get the Python package

mamba env create -f isweep-environment.yml

mamba activate isweep

python -c 'import site; print(site.getsitepackages())'

Download software.

bash get-software.sh

Requires wget.
You need to cite these software.

Pre-processing

Phase data w/ Beagle or Shapeit beforehand. Subset data in light of global ancestry and close relatedness. Example scripts are in scripts/pre-processing/.

Here is a pipeline we built for these purposes: https://github.com/sdtemple/flare-pipeline
You could use IBDkin to detect close relatedness: https://github.com/YingZhou001/IBDkin
You could use PCA, ADMIXTURE, or FLARE to determine global ancestry.

Main analysis

You will see more details for each step in workflow/some-pipeline/README.md files.

For all workflows

Make pointers to large (phased) vcf files.
Edit YAML files in the different workflow directories.

Detecting recent selection

Run the selection scan (workflow/scan-selection).

nohup snakemake -s Snakefile-scan.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &

See the file misc/cluster-options.md for support.
Recommendation: do a test run with your 2 smallest chromosomes.
Check *.log files from ibd-ends. If it recommends an estimated err, change error rate in YAML file.
Then, run with all your chromosomes.

Make the IBD rates plot customized if you want: workflow/scan-selection/scripts/plotting/plot-scan.py.

Outputs:

scan.modified.ibd.tsv should have all the data for the scanning statistics and thresholds.
- 'Z' variables are standardized/normalized.
- 'RAW' are counts.
- p values assume that IBD rates are (asymptotically) normally distributed.
roi.tsv are your significant regions.
autocovariance.png is autocovariance by cM distance. The black line is a fitted exponential curve.
zhistogram.png is a default histogram for the IBD rates standardized. It should "look Gaussian".
scan.png is a default plot for the selection scan.
fwer.analytical.txt gives parameters and estimates for multiple-testing selection scan.

Modeling putative sweeps

Estimate recent effective sizes :workflow/scan-selection/scripts/run-ibdne.sh.
Checkout the roi.tsv file.

Edit with locus names if you want.
Edit to change defaults: additive model and 95% confidence intervals.

Run the region of interest analysis (workflow/model-selection).

nohup snakemake -s Snakefile-roi.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &

The script to estimate recent Ne can be replaced with any method to estimate recent Ne, as it happens before the snakemake command. This method HapNe is one such option.

Outputs:

summary.hap.norm.tsv are estimated selection coefficients, and other estimates, for regions of interest.
- Read Temple, Waples, and Browning (AJHG, 2024) to learn about the estimates.
- Confidence intervals assume IBD rates are (asymptotically) normally distributed.
- Frequency estimate is based on the best differentiated SNP subset.
- Models are 'a' additive, 'm' multiplicative, 'd' dominance, and 'r' recessive.
Other types of confidence intervals.
- 'perc' wildcard means percentile-based confidence intervals.
- 'snp' wildcard means that frequency estimate is based on best differentiated SNP.

Other considerations

These Markdown files are in the folder misc/.

See telomeres-centromeres.md for cautionary comments on interpreting results near these genomic regions.

See small-chromosomes.md for comments on modified analyses when some chromosomes measure <= 10 cM.

See different-chromosome-rates.md for comments on modified analyses when chromosome subsets have vastly different mean/median IBD rates.

For species-specific conversions very different from 1.0 cM $\approx$ 1 Mb (humans), see recombination-rates.md

Picture of selection scan workflow

The flow chart below shows the steps ("rules") in the selection scan pipeline.

Diverting paths "mle" versus "scan" refer to different detection thresholds (3.0 and 2.0 cM).

See dag-roi.png for the steps in the sweep modeling pipeline.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.1.3

Sep 22, 2025

1.1.2

Jun 10, 2025

1.0.8

Feb 3, 2025

1.0.7

Feb 3, 2025

1.0.6

Jan 28, 2025

1.0.5

Jan 27, 2025

1.0.4

Jan 27, 2025

1.0.3

Jan 27, 2025

This version

1.0.2

Jan 7, 2025

1.0.1

Jul 15, 2024

1.0.0

Feb 28, 2024

0.4.0

Nov 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isweep-1.0.2.tar.gz (468.5 kB view details)

Uploaded Jan 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

isweep-1.0.2-py3-none-any.whl (30.8 kB view details)

Uploaded Jan 7, 2025 Python 3

File details

Details for the file isweep-1.0.2.tar.gz.

File metadata

Download URL: isweep-1.0.2.tar.gz
Upload date: Jan 7, 2025
Size: 468.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for isweep-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d3b697415bc67409ded4144c0a84cbf4f93bee95b0371626eaf3997117dbb8c2`
MD5	`c471bbd8d62011132817666955e8a0f7`
BLAKE2b-256	`2297fe9879265fbae7392afd9cb5ed5ae45787927cb3a080f2354846a577112e`

See more details on using hashes here.

File details

Details for the file isweep-1.0.2-py3-none-any.whl.

File metadata

Download URL: isweep-1.0.2-py3-none-any.whl
Upload date: Jan 7, 2025
Size: 30.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for isweep-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce4ada6e866883887911fcadee335e551347ab801258fc9dd9a2df586f6c97df`
MD5	`c32b2acf52915c42412b7fad55f972d0`
BLAKE2b-256	`5b92227cc628f126d14168715e360522cbbbee75998e09517edaa64b7c16a8c8`

See more details on using hashes here.

isweep 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Statistical inference using long IBD segments

Citation

Methods to model selection:

IBD central limit theorems:

Genome-wide significance thresholds in the selection scan:

Genome-wide significance thresholds in the IBD case-control mapping:

Simulating IBD around a locus:

Unifying framework of the selection scan and sweep modeling:

Methodology

Automated analysis pipeline(s):

The input data is:

Repository overview

Installation

Pre-processing

Main analysis

For all workflows

Detecting recent selection

Modeling putative sweeps

Other considerations

Picture of selection scan workflow

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes