Skip to main content

A tool for detecting positive selective sweeps using time-series genomcis data.

Project description

TimeSweeper

Timesweeper is a package for detecting positive selective sweeps from time-series genomic sampling using convolutional neural networks.

The associated manuscript can be found here: https://www.biorxiv.org/content/10.1101/2022.07.06.499052v1 Experiments and figures for the Timesweeper manuscript can be found here: https://github.com/SchriderLab/timesweeper-experiments

Table of Contents

Overview

Timesweeper is built as a series of modules that are chained together to build a workflow for detecting signatures of selective sweeps using simulated demographic models to train a 1D convolutional neural network (1DCNN). Some modules can be swapped out or modified for different use cases (such as simulating from a custom SLiM script), but the general workflow order is assumed to be consistent.

Workflow Overview

  1. Create a SLiM script
    1. Either based on the example_demo_model.slim example
    2. Or by using stdpopsim to generate a SLiM script
  2. Simulate demographic model with time-series sampling
    1. sim_custom if using custom SLiM script
    2. sim_stdpopsim if using a SLiM script output by stdpopsim
    • Note: If available, we suggest using a job submission platform such as SLURM to parallelize simulations. This is the most resource and time-intensive part of the module by far.
    • Optional: Preprocess VCFs simulated without timesweepers simulation modules by merging with process_vcfs
  3. Create features for the neural network with condense
  4. Train networks with train
  5. Run detect on VCF of interest using trained models and input data

Installation

If you have GNU Make installed the Timesweeper conda environment and SLiM can be downloaded and setup using

git clone git@github.com:SchriderLab/timeSeriesSweeps.git

cd timeSeriesSweeps
make

Otherwise you can install dependencies the long way with:

git clone git@github.com:SchriderLab/timeSeriesSweeps.git

conda env create -f blinx.yml

conda activate blinx

pip install .

Timesweeper Configuration

For any given experiment run you will need a YAML configuration file (see example_timesweeper_config.yaml for template) or the patience to write out all of your configuration parameters as arguments to each step of the workflow. The requirements for configuration are pretty different between using stdpopsim and custom SLiM script simulations, so I'll lay them out separately here. Each config argument will have the easy to read name and then will be followed by the YAML identifier to use in the config file.

Configs required for both types of simulation:

  • Working Directory (work dir) - this serves as the "home base" of an experiment. This is where all intermediate and final data will be written to.
  • SLiM Script (slimfile) - either generated by stdpopsim (see "Using stdpopsim to generate a SLiM template" below) or from a custom SLiM script (see "Using a custom SLiM script").
  • Sample sizes of each timepoint (sample sizes) - sampled in your data (i.e. how many individuals are included in timepoint 1, timepoint 2, etc)
  • Ploidy of your samples (yes, believe it or not we support not just diploids!)
  • Simulation Replicates (reps) - for each scenario: neutral, selection on de novo mutation, selection on standing variation. We like 2500, but you could probably get away with fewer!
  • Path to SLiM Executable (slim path) - if you use the Makefile will be /<path>/<to>/<timesweeper>/SLiM/build/slim

Additional configs needed for stdpopsim simulation:

  • Target Population ID (pop) - will be something like "p1" or "p2", will be identified in the stdpopsim model catalog.
  • Years Sampled (years sampled) - typically in years Before Present (BP - which is years prior to 1950 apparently). This is only needed if you are injecting sampling into a stdpopsim model or if you'd like to calculate FIT values along with the AFT predictions with Timesweeper. If you don't have any idea what these values should be you can still run Timesweeper and get AFT predictions no problem, just use a custom SLiM script and remove this line from the config file.
  • Selection Generation Before Sampling (selection gen) - Number of generations before the first sample timepoint. We've found that you can be relatively flexible (see the manuscript), but 50 is a good default value.
  • Selection Coefficient Bounds (selection coeff bounds) - to improve robustness we use a log-normal distribution to draw selection coefficients with lower and upper bounds specified here. If you want a non-stochastic selection coefficient simply use the same number twice.
  • Mutation Rate (mut rate) - just overwrites the stdpopsim mutation rate in case you'd like to fiddle with it.
  • Generation Time (gen time) - allows conversions between generations and continuous time.

Example config file for a stdpopsim simulation run:

#General
work dir: example_experiment
slimfile: example_model.slim
pop: p1
sample sizes: [5, 4, 6, 4, 5]
years sampled: [500, 400, 300, 200, 100]
ploidy: 2

#Simulation
reps: 2500
selection gen: 200 # Gens before first sampling
selection coeff bounds: [0.005, 0.5]
mut rate: 1.29e-8
gen time: 25

Timesweeper Modules Details

Custom Simulation (simulate_custom)

A flexible wrapper for a SLiM script that assumes you have a demographic model already defined in SLiM and would like to use it for Timesweeper. This module is meant to be modified with utility functions and other constants to feed into simulation replicates using the string variable d_block. This set of constants will be fed to SLiM at runtime but can be modified in the module as needed prior. The module on GitHub is the version used to generate simulations for the manuscript, and is meant to provide ideas/context for how it's possible to modify the d_block and write utility functions for it.

There are some basic requirements for using a custom SLiM script:

  1. Each timepoint must be output using a the outputVCFSample function like so: <pop_id>.outputVCFSample(<num_individuals>, replace=T, filePath=outFile, append=T);
  2. The constants sweep, outFile, and dumpFile must be used in your simulation script
    • sweep: one of "neut"/"sdn"/"soft", equivalent to neutral, selection on de novo mutation, and selection on standing variation respectively. This identifier is used both in the SLiM script to condition on scenarios but also in the output file naming.
    • outFile: is the VCF file that will be used as output for samples for a given replicate. Will be set as <work_dir>/vcfs/<sweep>/<rep>.multivcf
    • dumpFile: similarly to outFile this is where the intermediate simulation state is saved to in case of mutation loss or other problems with a replicate.
$ timesweeper sim_custom -h
usage: timesweeper sim_custom [-h] [--threads THREADS]
                          [--rep-range REP_RANGE REP_RANGE]
                          {yaml,cli} ...

Simulates selection for training Timesweeper using a pre-made SLiM script.

positional arguments:
  {yaml,cli}

optional arguments:
  -h, --help            show this help message and exit
  --threads THREADS     Number of processes to parallelize across. Defaults to all.
  --rep-range START STOP
                        <start, stop>. If used, only range(start, stop) will
                        be simulated for reps. This is to allow for easy SLURM
                        parallel simulations.

$ timesweeper sim_custom cli -h
usage: timesweeper sim_custom cli [-h] [-w WORK_DIR] -i SLIM_FILE
                              [--slim-path SLIM_PATH] [--reps REPS]

optional arguments:
  -h, --help            show this help message and exit
  -w WORK_DIR, --work-dir WORK_DIR
                        Directory used as work dir for simulate modules.
                        Should contain simulated vcfs processed using
                        process_vcf.py.
  -i SLIM_FILE, --slim-file SLIM_FILE
                        SLiM Script to simulate with. Must output to a single
                        VCF file.
  --slim-path SLIM_PATH
                        Path to SLiM executable.
  --reps REPS           Number of replicate simulations to run if not using rep-range.

timesweeper sim_custom yaml -h
usage: timesweeper sim_custom yaml [-h] YAML_CONFIG

positional arguments:
  YAML_CONFIG  YAML config file with all cli options defined.

optional arguments:
  -h, --help   show this help message and exit

stpopsim Simulation (simulate_stdpopsim)

For use with SLiM scripts that have been generated using stdpopsim's --slim-script option to output the model. This allows for out of the box demographic models downloaded straight from the catalog stdpopsim adds to regularly. Some information needs to be gotten from the model definition so that the wrapper knows which population to sample from, how to scale values if rescaling the simulation, and more. These are described in detail both in the help message of the module and in the above doc section "Configs required for both types of simulation".

usage: timesweeper sim_stdpopsim [-h] [-v] [--threads THREADS]
                             [--rep-range REP_RANGE REP_RANGE]
                             {yaml,cli} ...

Injects time-series sampling into stdpopsim SLiM output script.

positional arguments:
  {yaml,cli}

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Print verbose logging during process.
  --threads THREADS     Number of processes to parallelize across.
  --rep-range REP_RANGE REP_RANGE
                        <start, stop>. If used, only range(start, stop) will
                        be simulated for reps. This is to allow for easy SLURM
                        parallel simulations.

timesweeper sim_stdpopsim cli -h
usage: timesweeper sim_stdpopsim cli [-h] -i SLIM_FILE --reps REPS [--pop POP]
                                 --sample_sizes SAMPLE_SIZES
                                 [SAMPLE_SIZES ...] --years-sampled
                                 YEARS_SAMPLED [YEARS_SAMPLED ...]
                                 [--selection-generation SEL_GEN]
                                 [--selection-coeff-bounds SEL_COEFF_BOUNDS SEL_COEFF_BOUNDS]
                                 [--mut-rate MUT_RATE] [--work-dir WORK_DIR]
                                 [--slim-path SLIM_PATH]

optional arguments:
  -h, --help            show this help message and exit
  -i SLIM_FILE, --slim-file SLIM_FILE
                        SLiM Script output by stdpopsim to add time-series
                        sampling to.
  --reps REPS           Number of replicate simulations to run. If using
                        rep_range can just fill with random int.
  --pop POP             Label of population to sample from, will be defined in
                        SLiM Script. Defaults to p2.
  --sample_sizes SAMPLE_SIZES [SAMPLE_SIZES ...]
                        Number of individuals to sample without replacement at
                        each sampling point. Will be multiplied by ploidy to
                        sample chroms from slim. Must match the number of
                        entries in the -y flag.
  --years-sampled YEARS_SAMPLED [YEARS_SAMPLED ...]
                        Years BP (before 1950) that samples are estimated to
                        be from. Must match the number of entries in the --sample-sizes
                        flag.
  --selection-generation SEL_GEN
                        Number of gens before first sampling to introduce
                        selection in population. Defaults to 200.
  --selection-coeff-bounds SEL_COEFF_BOUNDS SEL_COEFF_BOUNDS
                        Bounds of log-uniform distribution for pulling
                        selection coefficient of mutation being introduced.
                        Defaults to [0.005, 0.5]
  --mut-rate MUT_RATE   Mutation rate for mutations not being tracked for
                        sweep detection. Defaults to 1.29e-8 as defined in
                        stdpopsim for OoA model.
  --work-dir WORK_DIR   Directory to start workflow in, subdirs will be
                        created to write simulation files to. Will be used in
                        downstream processing as well.
  --slim-path SLIM_PATH
                        Path to SLiM executable.

$ timesweeper sim_stdpopsim yaml -h
usage: timesweeper sim_stdpopsim yaml [-h] YAML CONFIG

positional arguments:
  YAML CONFIG  YAML config file with all cli options defined.

optional arguments:
  -h, --help   show this help message and exit

Process VCF Files (process)

This module splits the multivcf files (which are just multiple concatenated VCF entries) generated by SLiM and then merges them in order from most ancient to most current timepoints. This replicates the preparation process needed for Timesweeper input data but at scale and automatically parameterized for data simulated using Timesweeper's simulation modules.

$ timesweeper process -h
usage: timesweeper process [-h] [--vcf-header VCF_HEADER] [--threads THREADS]
                       {yaml,cli} ...

Splits and re-merges VCF files to prepare for fast feature creation.

positional arguments:
  {yaml,cli}

optional arguments:
  -h, --help            show this help message and exit
  --vcf-header VCF_HEADER
                        String that tops VCF header, used to split entries to
                        new files.
  --threads THREADS     Number of processes to parallelize across.

$ timesweeper process cli -h
usage: timesweeper process cli [-h] [-w WORK_DIR] --sample_sizes SAMPLE_SIZES
                           [SAMPLE_SIZES ...]

optional arguments:
  -h, --help            show this help message and exit
  -w WORK_DIR, --work-dir WORK_DIR
                        Directory used as work dir for simulate modules.
                        Should contain simulated vcfs processed using
                        process_vcf.py.
  --sample_sizes SAMPLE_SIZES [SAMPLE_SIZES ...]
                        Number of individuals to sample without replacement at
                        each sampling point. Will be multiplied by ploidy to
                        sample chroms from slim. Must match the number of
                        entries in the -y flag.

$ timesweeper process yaml -h
usage: timesweeper process yaml [-h] YAML CONFIG

positional arguments:
  YAML CONFIG  YAML config file with all cli options defined.

optional arguments:
  -h, --help   show this help message and exit

Make Training Data (condense)

VCFs merged using timesweeper process are read in as allele frequencies using scikit-allel, and depending on the scenario (neut/sdn/soft) the central or locus under selection is pulled out and aggregated for all replicates. This labeled ground-truth data from simulations is then saved as a dictionary in a pickle file for easy access and low disk usage.

This module also allows for adding missingness to the training data in the case of missingness in the real data Timesweeper is going to be used on. To do this add the -m <val> flag where val is in [0,1] and is used as the parameter of a binomial draw for each allele per timestep to set as present/missing. We show in the manuscript that some missingness is viable (e.g. val=0.2), however high missingness (e.g. val=0.5) will result in terrible performance and should be avoided. Optimally this value should reflect the missingness present in the real data input to Timesweeper so as to parameterize the network to be better prepared for it.

Note: the process of retrieving known-selection sites is based on the mutation type labels contained in VCF INFO fields output by SLiM. It currently assumes the mutation type where selection is being introduced is identified as "m2", but if you use a custom SLiM model and happen to change mutation type this module should be modified to properly scan for that.

$ timesweeper condense -h
usage: timesweeper condense [-h] [--threads THREADS] [-m MISSINGNESS]
                                 {yaml,cli} ...

Creates training data from simulated merged vcfs after timesweeper process has
been run.

positional arguments:
  {yaml,cli}

optional arguments:
  -h, --help            show this help message and exit
  --threads THREADS     Number of processes to parallelize across.
  -m MISSINGNESS, --missingness MISSINGNESS
                        Missingness rate in range of [0,1], used as the
                        parameter of a binomial distribution for randomly
                        removing known values.

$ timesweeper condense cli -h
usage: timesweeper condense cli [-h] [-w WORK_DIR] -s SAMP_SIZES
                                     [SAMP_SIZES ...]

optional arguments:
  -h, --help            show this help message and exit
  -w WORK_DIR, --work-dir WORK_DIR
                        Directory used as work dir for simulate modules.
                        Should contain simulated vcfs processed using
                        process_vcf.py.
  -s SAMP_SIZES [SAMP_SIZES ...], --sample-sizes SAMP_SIZES [SAMP_SIZES ...]
                        Number of individuals from each timepoint sampled.
                        Used to index VCF data from earliest to latest
                        sampling points.

$ timesweeper condense yaml -h
usage: timesweeper condense yaml [-h] YAML CONFIG

positional arguments:
  YAML CONFIG  YAML config file with all cli options defined.

optional arguments:
  -h, --help   show this help message and exit

Neural Networks (train)

Timesweeper's neural network architecture is a shallow 1DCNN implemented in Keras2 with a Tensorflow backend that trains extremely fast on CPUs with very little RAM needed. Assuming all previous steps were run it can be trained and evaluated on hold-out test data with a single line invocation.

$ timesweeper train -h
usage: timesweeper train [-h] [-n EXPERIMENT_NAME] {yaml,cli} ...

Handler script for neural network training and prediction for TimeSweeper
Package. Will train two models: one for the series of timepoints generated
using the hfs vectors over a timepoint and one

positional arguments:
  {yaml,cli}

optional arguments:
  -h, --help            show this help message and exit
  -n EXPERIMENT_NAME, --experiment-name EXPERIMENT_NAME
                        Identifier for the experiment used to generate the
                        data. Optional, but helpful in differentiating runs.

$ timesweeper train cli -h
usage: timesweeper train cli [-h] [-w WORK_DIR]

optional arguments:
  -h, --help            show this help message and exit
  -w WORK_DIR, --work-dir WORK_DIR
                        Directory used as work dir for simulate modules.
                        Should contain pickled training data from simulated vcfs processed using
                        process_vcf.py.

$ timesweeper train yaml -h
usage: timesweeper train yaml [-h] YAML CONFIG

positional arguments:
  YAML CONFIG  YAML config file with all cli options defined.

optional arguments:
  -h, --help   show this help message and exit

Detect Sweeps (detect)

Finally, the main module of the package is for detecting sweeps in a given VCF. This loads in the prepared VCF (see "Preparing Input Data for Timesweeper" below) in chunks, converts allele data to time-series allele velocity data, and predicts using the 1DCNN trained on simulated data. Each prediction represents a 51-SNP window with the focal allele being the actual target.

Timesweeper outputs predictions as both a csv file with headers Chrom BP Class Neut Score Hard Score Soft Score Win_Start Win_End and a BED file with the headers Chrom Win_Start Win_End BP. The BED file allows for easy intersections using bedtools and can be cross-referenced back to the CSV for score filtering.

Here are the details on the headers:

  • Chrom: Chromosome/contig, identical to VCF file name of it
  • BP: location of central allele in the window being predicted on
  • Class: Neut/Hard/Soft, class identified by the maximum score in 3-class softmax output from the model
  • Neut/Hard/Soft Score: raw score from softmax final layer of 1DCNN
  • Win_Start/End: left and right-most locations of the SNPs on each side of the window being predicted on

Note: By default Timesweeper only outputs sites with a minimum sweep (sdn+soft) score of 0.66 to prevent massive amounts of neutral outputs. This value could easily be modified in the module but we find it better to filter after the fact for more flexibility.

Timesweeper will optionally run frequency increment test if the generation time and years sampled are provided. In some cases this may not be available/desired, in which case if either of those values are left out it will not be run.

Timesweeper also has a --benchmark flag that will allow for testing accuracy on simulated data if wanted. This will search the input data for the mutation type identifier flags allowing a benchmark of detection accuracy on data that has a ground truth.

$ timesweeper detect -h
usage: timesweeper detect [-h] -i INPUT_VCF [--benchmark] --aft-model AFT_MODEL
                      {yaml,cli} ...

Module for iterating across windows in a time-series vcf file and predicting
whether a sweep is present at each snp-centralized window.

positional arguments:
  {yaml,cli}

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_VCF, --input-vcf INPUT_VCF
                        Merged VCF to scan for sweeps. Must be merged VCF
                        where files are merged in order from earliest to
                        latest sampling time, -0 flag must be used.
  --benchmark           If testing on simulated data and would like to report
                        the mutation type stored by SLiM during
                        outputVCFSample as well as neutral predictions, use
                        this flag. Otherwise the mutation type will not be
                        looked for in the VCF entry nor reported with results.
  --aft-model AFT_MODEL
                        Path to Keras2-style saved model to load for aft
                        prediction.

$ timesweeper detect cli -h
usage: timesweeper detect cli [-h] -s SAMP_SIZES [SAMP_SIZES ...] [-w WORKING_DIR]
                          [--years-sampled YEARS_SAMPLED [YEARS_SAMPLED ...]]
                          [--gen-time GEN_TIME]

optional arguments:
  -h, --help            show this help message and exit
  -s SAMP_SIZES [SAMP_SIZES ...], --sample-sizes SAMP_SIZES [SAMP_SIZES ...]
                        Number of individuals from each timepoint sampled.
                        Used to index VCF data from earliest to latest
                        sampling points.
  -w WORKING_DIR, --work-dir WORKING_DIR
                        Working directory for workflow, should be identical to
                        previous steps.
  --years-sampled YEARS_SAMPLED [YEARS_SAMPLED ...]
                        Years BP (before 1950) that samples are estimated to
                        be from. Only used for FIT calculations, and is
                        optional if you don't care about those.
  --gen-time GEN_TIME   Generation time to multiply years_sampled by.
                        Similarly to years_sampled, only used for FIT
                        calculation and is optional.

$ timesweeper detect yaml -h
usage: timesweeper detect yaml [-h] YAML CONFIG

positional arguments:
  YAML CONFIG  YAML config file with all cli options defined.

optional arguments:
  -h, --help   show this help message and exit

Preparing Input Data for Timesweeper

Timesweeper needs a specific format of input data to work, namely a VCF file merged to the superset of all samples and polymorphisms that is merged in order of oldest to most recent. That last point is extremely important and easily missed

VCFs of all samples will need to be merged using the bcftools merge -Oz --force-samples <inputs_earliest.vcf ... inputs_latest.vcf> > merged.vcf.gz options.


Example Usage

Below is an example run of Timesweeper using the example config and simulation script provided in the repo. This will go through the entire process of simulating, training the model, processing input and calling predicted sweeps on a VCF of interest called foo.vcf.

conda activate blinx
cd timesweeper

#Simulate training data
timesweeper sim_custom yaml example_config.yaml

#Process VCFs 
timesweeper process yaml example_config.yaml

#Assume foo.vcf has a missingness of 0.05 and create pickle file
timesweeper condense -m 0.05 yaml example_config.yaml

#Train network
timesweeper train -n example_ts_run yaml example_config.yaml

#Predict on input VCF
timesweeper detect -i foo.vcf --aft-model ts_experiment/trained_models/example_ts_run_Timesweeper_aft

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Timesweeper-1.1.1.tar.gz (41.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page