A tool for detecting positive selective sweeps using time-series genomcis data.
Project description
TimeSweeper
Timesweeper is a package for detecting positive selective sweeps from time-series genomic sampling using convolutional neural networks.
The associated manuscript can be found here: https://www.biorxiv.org/content/10.1101/2022.07.06.499052v1 Experiments and figures for the Timesweeper manuscript can be found here: https://github.com/SchriderLab/timesweeper-experiments
Table of Contents
- TimeSweeper
Overview
Timesweeper is built as a series of modules that are chained together to build a workflow for detecting signatures of selective sweeps using simulated demographic models to train a 1D convolutional neural network (1DCNN). Some modules can be swapped out or modified for different use cases (such as simulating from a custom SLiM script), but the general workflow order is assumed to be consistent.
Workflow Overview
- Create a SLiM script
- Either based on the
example_demo_model.slim
example - Or by using stdpopsim to generate a SLiM script
- Either based on the
- Simulate demographic model with time-series sampling
sim_custom
if using custom SLiM scriptsim_stdpopsim
if using a SLiM script output by stdpopsim
- Note: If available, we suggest using a job submission platform such as SLURM to parallelize simulations. This is the most resource and time-intensive part of the module by far.
- Optional: Preprocess VCFs simulated without timesweepers simulation modules by merging with
process_vcfs
- Create features for the neural network with
condense
- Train networks with
train
- Run
detect
on VCF of interest using trained models and input data
Installation
If you have GNU Make installed the Timesweeper conda environment and SLiM can be downloaded and setup using
git clone git@github.com:SchriderLab/timeSeriesSweeps.git
cd timeSeriesSweeps
make
Otherwise you can install dependencies the long way with:
git clone git@github.com:SchriderLab/timeSeriesSweeps.git
conda env create -f blinx.yml
conda activate blinx
pip install .
Timesweeper Configuration
For any given experiment run you will need a YAML configuration file (see example_timesweeper_config.yaml
for template) or the patience to write out all of your configuration parameters as arguments to each step of the workflow. The requirements for configuration are pretty different between using stdpopsim and custom SLiM script simulations, so I'll lay them out separately here. Each config argument will have the easy to read name and then will be followed by the YAML identifier to use in the config file.
Configs required for both types of simulation:
- Working Directory (
work dir
) - this serves as the "home base" of an experiment. This is where all intermediate and final data will be written to. - SLiM Script (
slimfile
) - either generated by stdpopsim (see "Using stdpopsim to generate a SLiM template" below) or from a custom SLiM script (see "Using a custom SLiM script"). - Sample sizes of each timepoint (
sample sizes
) - sampled in your data (i.e. how many individuals are included in timepoint 1, timepoint 2, etc) - Ploidy of your samples (yes, believe it or not we support not just diploids!)
- Simulation Replicates (
reps
) - for each scenario: neutral, selection on de novo mutation, selection on standing variation. We like 2500, but you could probably get away with fewer! - Path to SLiM Executable (
slim path
) - if you use the Makefile will be/<path>/<to>/<timesweeper>/SLiM/build/slim
Additional configs needed for stdpopsim simulation:
- Target Population ID (
pop
) - will be something like "p1" or "p2", will be identified in the stdpopsim model catalog. - Years Sampled (
years sampled
) - typically in years Before Present (BP - which is years prior to 1950 apparently). This is only needed if you are injecting sampling into a stdpopsim model or if you'd like to calculate FIT values along with the AFT predictions with Timesweeper. If you don't have any idea what these values should be you can still run Timesweeper and get AFT predictions no problem, just use a custom SLiM script and remove this line from the config file. - Selection Generation Before Sampling (
selection gen
) - Number of generations before the first sample timepoint. We've found that you can be relatively flexible (see the manuscript), but 50 is a good default value. - Selection Coefficient Bounds (
selection coeff bounds
) - to improve robustness we use a log-normal distribution to draw selection coefficients with lower and upper bounds specified here. If you want a non-stochastic selection coefficient simply use the same number twice. - Mutation Rate (
mut rate
) - just overwrites the stdpopsim mutation rate in case you'd like to fiddle with it. - Generation Time (
gen time
) - allows conversions between generations and continuous time.
Example config file for a stdpopsim simulation run:
#General
work dir: example_experiment
slimfile: example_model.slim
pop: p1
sample sizes: [5, 4, 6, 4, 5]
years sampled: [500, 400, 300, 200, 100]
ploidy: 2
#Simulation
reps: 2500
selection gen: 200 # Gens before first sampling
selection coeff bounds: [0.005, 0.5]
mut rate: 1.29e-8
gen time: 25
Timesweeper Modules Details
Custom Simulation (simulate_custom
)
A flexible wrapper for a SLiM script that assumes you have a demographic model already defined in SLiM and would like to use it for Timesweeper. This module is meant to be modified with utility functions and other constants to feed into simulation replicates using the string variable d_block
. This set of constants will be fed to SLiM at runtime but can be modified in the module as needed prior. The module on GitHub is the version used to generate simulations for the manuscript, and is meant to provide ideas/context for how it's possible to modify the d_block
and write utility functions for it.
There are some basic requirements for using a custom SLiM script:
- Each timepoint must be output using a the outputVCFSample function like so:
<pop_id>.outputVCFSample(<num_individuals>, replace=T, filePath=outFile, append=T);
- The constants
sweep
,outFile
, anddumpFile
must be used in your simulation scriptsweep
: one of "neut"/"sdn"/"soft", equivalent to neutral, selection on de novo mutation, and selection on standing variation respectively. This identifier is used both in the SLiM script to condition on scenarios but also in the output file naming.outFile
: is the VCF file that will be used as output for samples for a given replicate. Will be set as<work_dir>/vcfs/<sweep>/<rep>.multivcf
dumpFile
: similarly to outFile this is where the intermediate simulation state is saved to in case of mutation loss or other problems with a replicate.
$ timesweeper sim_custom -h
usage: timesweeper sim_custom [-h] [--threads THREADS]
[--rep-range REP_RANGE REP_RANGE]
{yaml,cli} ...
Simulates selection for training Timesweeper using a pre-made SLiM script.
positional arguments:
{yaml,cli}
optional arguments:
-h, --help show this help message and exit
--threads THREADS Number of processes to parallelize across. Defaults to all.
--rep-range START STOP
<start, stop>. If used, only range(start, stop) will
be simulated for reps. This is to allow for easy SLURM
parallel simulations.
$ timesweeper sim_custom cli -h
usage: timesweeper sim_custom cli [-h] [-w WORK_DIR] -i SLIM_FILE
[--slim-path SLIM_PATH] [--reps REPS]
optional arguments:
-h, --help show this help message and exit
-w WORK_DIR, --work-dir WORK_DIR
Directory used as work dir for simulate modules.
Should contain simulated vcfs processed using
process_vcf.py.
-i SLIM_FILE, --slim-file SLIM_FILE
SLiM Script to simulate with. Must output to a single
VCF file.
--slim-path SLIM_PATH
Path to SLiM executable.
--reps REPS Number of replicate simulations to run if not using rep-range.
timesweeper sim_custom yaml -h
usage: timesweeper sim_custom yaml [-h] YAML_CONFIG
positional arguments:
YAML_CONFIG YAML config file with all cli options defined.
optional arguments:
-h, --help show this help message and exit
stpopsim Simulation (simulate_stdpopsim
)
For use with SLiM scripts that have been generated using stdpopsim's --slim-script
option to output the model. This allows for out of the box demographic models downloaded straight from the catalog stdpopsim adds to regularly. Some information needs to be gotten from the model definition so that the wrapper knows which population to sample from, how to scale values if rescaling the simulation, and more. These are described in detail both in the help message of the module and in the above doc section "Configs required for both types of simulation".
usage: timesweeper sim_stdpopsim [-h] [-v] [--threads THREADS]
[--rep-range REP_RANGE REP_RANGE]
{yaml,cli} ...
Injects time-series sampling into stdpopsim SLiM output script.
positional arguments:
{yaml,cli}
optional arguments:
-h, --help show this help message and exit
-v, --verbose Print verbose logging during process.
--threads THREADS Number of processes to parallelize across.
--rep-range REP_RANGE REP_RANGE
<start, stop>. If used, only range(start, stop) will
be simulated for reps. This is to allow for easy SLURM
parallel simulations.
timesweeper sim_stdpopsim cli -h
usage: timesweeper sim_stdpopsim cli [-h] -i SLIM_FILE --reps REPS [--pop POP]
--sample_sizes SAMPLE_SIZES
[SAMPLE_SIZES ...] --years-sampled
YEARS_SAMPLED [YEARS_SAMPLED ...]
[--selection-generation SEL_GEN]
[--selection-coeff-bounds SEL_COEFF_BOUNDS SEL_COEFF_BOUNDS]
[--mut-rate MUT_RATE] [--work-dir WORK_DIR]
[--slim-path SLIM_PATH]
optional arguments:
-h, --help show this help message and exit
-i SLIM_FILE, --slim-file SLIM_FILE
SLiM Script output by stdpopsim to add time-series
sampling to.
--reps REPS Number of replicate simulations to run. If using
rep_range can just fill with random int.
--pop POP Label of population to sample from, will be defined in
SLiM Script. Defaults to p2.
--sample_sizes SAMPLE_SIZES [SAMPLE_SIZES ...]
Number of individuals to sample without replacement at
each sampling point. Will be multiplied by ploidy to
sample chroms from slim. Must match the number of
entries in the -y flag.
--years-sampled YEARS_SAMPLED [YEARS_SAMPLED ...]
Years BP (before 1950) that samples are estimated to
be from. Must match the number of entries in the --sample-sizes
flag.
--selection-generation SEL_GEN
Number of gens before first sampling to introduce
selection in population. Defaults to 200.
--selection-coeff-bounds SEL_COEFF_BOUNDS SEL_COEFF_BOUNDS
Bounds of log-uniform distribution for pulling
selection coefficient of mutation being introduced.
Defaults to [0.005, 0.5]
--mut-rate MUT_RATE Mutation rate for mutations not being tracked for
sweep detection. Defaults to 1.29e-8 as defined in
stdpopsim for OoA model.
--work-dir WORK_DIR Directory to start workflow in, subdirs will be
created to write simulation files to. Will be used in
downstream processing as well.
--slim-path SLIM_PATH
Path to SLiM executable.
$ timesweeper sim_stdpopsim yaml -h
usage: timesweeper sim_stdpopsim yaml [-h] YAML CONFIG
positional arguments:
YAML CONFIG YAML config file with all cli options defined.
optional arguments:
-h, --help show this help message and exit
Process VCF Files (process
)
This module splits the multivcf files (which are just multiple concatenated VCF entries) generated by SLiM and then merges them in order from most ancient to most current timepoints. This replicates the preparation process needed for Timesweeper input data but at scale and automatically parameterized for data simulated using Timesweeper's simulation modules.
$ timesweeper process -h
usage: timesweeper process [-h] [--vcf-header VCF_HEADER] [--threads THREADS]
{yaml,cli} ...
Splits and re-merges VCF files to prepare for fast feature creation.
positional arguments:
{yaml,cli}
optional arguments:
-h, --help show this help message and exit
--vcf-header VCF_HEADER
String that tops VCF header, used to split entries to
new files.
--threads THREADS Number of processes to parallelize across.
$ timesweeper process cli -h
usage: timesweeper process cli [-h] [-w WORK_DIR] --sample_sizes SAMPLE_SIZES
[SAMPLE_SIZES ...]
optional arguments:
-h, --help show this help message and exit
-w WORK_DIR, --work-dir WORK_DIR
Directory used as work dir for simulate modules.
Should contain simulated vcfs processed using
process_vcf.py.
--sample_sizes SAMPLE_SIZES [SAMPLE_SIZES ...]
Number of individuals to sample without replacement at
each sampling point. Will be multiplied by ploidy to
sample chroms from slim. Must match the number of
entries in the -y flag.
$ timesweeper process yaml -h
usage: timesweeper process yaml [-h] YAML CONFIG
positional arguments:
YAML CONFIG YAML config file with all cli options defined.
optional arguments:
-h, --help show this help message and exit
Make Training Data (condense
)
VCFs merged using timesweeper process
are read in as allele frequencies using scikit-allel, and depending on the scenario (neut/sdn/soft) the central or locus under selection is pulled out and aggregated for all replicates. This labeled ground-truth data from simulations is then saved as a dictionary in a pickle file for easy access and low disk usage.
This module also allows for adding missingness to the training data in the case of missingness in the real data Timesweeper is going to be used on. To do this add the -m <val>
flag where val
is in [0,1] and is used as the parameter of a binomial draw for each allele per timestep to set as present/missing. We show in the manuscript that some missingness is viable (e.g. val=0.2
), however high missingness (e.g. val=0.5
) will result in terrible performance and should be avoided. Optimally this value should reflect the missingness present in the real data input to Timesweeper so as to parameterize the network to be better prepared for it.
Note: the process of retrieving known-selection sites is based on the mutation type labels contained in VCF INFO fields output by SLiM. It currently assumes the mutation type where selection is being introduced is identified as "m2", but if you use a custom SLiM model and happen to change mutation type this module should be modified to properly scan for that.
$ timesweeper condense -h
usage: timesweeper condense [-h] [--threads THREADS] [-m MISSINGNESS]
{yaml,cli} ...
Creates training data from simulated merged vcfs after timesweeper process has
been run.
positional arguments:
{yaml,cli}
optional arguments:
-h, --help show this help message and exit
--threads THREADS Number of processes to parallelize across.
-m MISSINGNESS, --missingness MISSINGNESS
Missingness rate in range of [0,1], used as the
parameter of a binomial distribution for randomly
removing known values.
$ timesweeper condense cli -h
usage: timesweeper condense cli [-h] [-w WORK_DIR] -s SAMP_SIZES
[SAMP_SIZES ...]
optional arguments:
-h, --help show this help message and exit
-w WORK_DIR, --work-dir WORK_DIR
Directory used as work dir for simulate modules.
Should contain simulated vcfs processed using
process_vcf.py.
-s SAMP_SIZES [SAMP_SIZES ...], --sample-sizes SAMP_SIZES [SAMP_SIZES ...]
Number of individuals from each timepoint sampled.
Used to index VCF data from earliest to latest
sampling points.
$ timesweeper condense yaml -h
usage: timesweeper condense yaml [-h] YAML CONFIG
positional arguments:
YAML CONFIG YAML config file with all cli options defined.
optional arguments:
-h, --help show this help message and exit
Neural Networks (train
)
Timesweeper's neural network architecture is a shallow 1DCNN implemented in Keras2 with a Tensorflow backend that trains extremely fast on CPUs with very little RAM needed. Assuming all previous steps were run it can be trained and evaluated on hold-out test data with a single line invocation.
$ timesweeper train -h
usage: timesweeper train [-h] [-n EXPERIMENT_NAME] {yaml,cli} ...
Handler script for neural network training and prediction for TimeSweeper
Package. Will train two models: one for the series of timepoints generated
using the hfs vectors over a timepoint and one
positional arguments:
{yaml,cli}
optional arguments:
-h, --help show this help message and exit
-n EXPERIMENT_NAME, --experiment-name EXPERIMENT_NAME
Identifier for the experiment used to generate the
data. Optional, but helpful in differentiating runs.
$ timesweeper train cli -h
usage: timesweeper train cli [-h] [-w WORK_DIR]
optional arguments:
-h, --help show this help message and exit
-w WORK_DIR, --work-dir WORK_DIR
Directory used as work dir for simulate modules.
Should contain pickled training data from simulated vcfs processed using
process_vcf.py.
$ timesweeper train yaml -h
usage: timesweeper train yaml [-h] YAML CONFIG
positional arguments:
YAML CONFIG YAML config file with all cli options defined.
optional arguments:
-h, --help show this help message and exit
Detect Sweeps (detect
)
Finally, the main module of the package is for detecting sweeps in a given VCF. This loads in the prepared VCF (see "Preparing Input Data for Timesweeper" below) in chunks, converts allele data to time-series allele velocity data, and predicts using the 1DCNN trained on simulated data. Each prediction represents a 51-SNP window with the focal allele being the actual target.
Timesweeper outputs predictions as both a csv file with headers Chrom BP Class Neut Score Hard Score Soft Score Win_Start Win_End
and a BED file with the headers Chrom Win_Start Win_End BP
. The BED file allows for easy intersections using bedtools and can be cross-referenced back to the CSV for score filtering.
Here are the details on the headers:
- Chrom: Chromosome/contig, identical to VCF file name of it
- BP: location of central allele in the window being predicted on
- Class: Neut/Hard/Soft, class identified by the maximum score in 3-class softmax output from the model
- Neut/Hard/Soft Score: raw score from softmax final layer of 1DCNN
- Win_Start/End: left and right-most locations of the SNPs on each side of the window being predicted on
Note: By default Timesweeper only outputs sites with a minimum sweep (sdn+soft) score of 0.66 to prevent massive amounts of neutral outputs. This value could easily be modified in the module but we find it better to filter after the fact for more flexibility.
Timesweeper will optionally run frequency increment test if the generation time and years sampled are provided. In some cases this may not be available/desired, in which case if either of those values are left out it will not be run.
Timesweeper also has a --benchmark
flag that will allow for testing accuracy on simulated data if wanted. This will search the input data for the mutation type identifier flags allowing a benchmark of detection accuracy on data that has a ground truth.
$ timesweeper detect -h
usage: timesweeper detect [-h] -i INPUT_VCF [--benchmark] --aft-model AFT_MODEL
{yaml,cli} ...
Module for iterating across windows in a time-series vcf file and predicting
whether a sweep is present at each snp-centralized window.
positional arguments:
{yaml,cli}
optional arguments:
-h, --help show this help message and exit
-i INPUT_VCF, --input-vcf INPUT_VCF
Merged VCF to scan for sweeps. Must be merged VCF
where files are merged in order from earliest to
latest sampling time, -0 flag must be used.
--benchmark If testing on simulated data and would like to report
the mutation type stored by SLiM during
outputVCFSample as well as neutral predictions, use
this flag. Otherwise the mutation type will not be
looked for in the VCF entry nor reported with results.
--aft-model AFT_MODEL
Path to Keras2-style saved model to load for aft
prediction.
$ timesweeper detect cli -h
usage: timesweeper detect cli [-h] -s SAMP_SIZES [SAMP_SIZES ...] [-w WORKING_DIR]
[--years-sampled YEARS_SAMPLED [YEARS_SAMPLED ...]]
[--gen-time GEN_TIME]
optional arguments:
-h, --help show this help message and exit
-s SAMP_SIZES [SAMP_SIZES ...], --sample-sizes SAMP_SIZES [SAMP_SIZES ...]
Number of individuals from each timepoint sampled.
Used to index VCF data from earliest to latest
sampling points.
-w WORKING_DIR, --work-dir WORKING_DIR
Working directory for workflow, should be identical to
previous steps.
--years-sampled YEARS_SAMPLED [YEARS_SAMPLED ...]
Years BP (before 1950) that samples are estimated to
be from. Only used for FIT calculations, and is
optional if you don't care about those.
--gen-time GEN_TIME Generation time to multiply years_sampled by.
Similarly to years_sampled, only used for FIT
calculation and is optional.
$ timesweeper detect yaml -h
usage: timesweeper detect yaml [-h] YAML CONFIG
positional arguments:
YAML CONFIG YAML config file with all cli options defined.
optional arguments:
-h, --help show this help message and exit
Preparing Input Data for Timesweeper
Timesweeper needs a specific format of input data to work, namely a VCF file merged to the superset of all samples and polymorphisms that is merged in order of oldest to most recent. That last point is extremely important and easily missed
VCFs of all samples will need to be merged using the bcftools merge -Oz --force-samples <inputs_earliest.vcf ... inputs_latest.vcf> > merged.vcf.gz
options.
Example Usage
Below is an example run of Timesweeper using the example config and simulation script provided in the repo. This will go through the entire process of simulating, training the model, processing input and calling predicted sweeps on a VCF of interest called foo.vcf
.
conda activate blinx
cd timesweeper
#Simulate training data
timesweeper sim_custom yaml example_config.yaml
#Process VCFs
timesweeper process yaml example_config.yaml
#Assume foo.vcf has a missingness of 0.05 and create pickle file
timesweeper condense -m 0.05 yaml example_config.yaml
#Train network
timesweeper train -n example_ts_run yaml example_config.yaml
#Predict on input VCF
timesweeper detect -i foo.vcf --aft-model ts_experiment/trained_models/example_ts_run_Timesweeper_aft
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file Timesweeper-1.1.1.tar.gz
.
File metadata
- Download URL: Timesweeper-1.1.1.tar.gz
- Upload date:
- Size: 41.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6322a11999251a2a76254750ec430e1ec3945b8d44aea35788cca149d37515e2 |
|
MD5 | 714376bd8486f5372a8341bfbdd25e58 |
|
BLAKE2b-256 | c26630f34118da7b4161f200539e3d216845bb9ae4ba58ae6a674285a4b65553 |