Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

A package for the simulation of RADseq data.

Project description

RADinitio

RADinitio is a pipeline for the assessment of RADseq experiments via prospective and retrospective data simulation. Sequencing data is generated de novo for multiple individuals via a coalescent simulation under a user-defined demographic model using msprime. The genetic variants in each sample are simulated in a genomic context that can impact downstream generation of RAD loci and sequencing reads. The per-individual sequences then treated to an in silico RADseq library preparation. The components of the library are defined by the user, allowing for the exploration of parameters including restriction enzyme selection, insert size, sequencing coverage, and PCR duplicate distribution. RADinitio simulations can also be applied retrospectively by comparing and modelling sources of error in empirical datasets. The purpose of RADinitio is for researchers to fully explore possible variables in their data generation process to ensure that their protocol selection and library preparation is performed optimally, within the limitations of technical and experimental error.

Installation

RADinitio can be installed from the Python Package Index (PyPI) using:

python3 -m pip install radinitio

More information regarding this installation can be obtained at the RADinitio PyPI Project page.

Users can install RADinitio in their local Python directories using the --user flag:

python3 -m pip install radinitio --user

Alternatively, the software can be downloaded from the Catchen Lab website and installed using the following commands:

python3 -m pip install radinitio-version.tar.gz

Or for a local installation:

python3 -m pip install radinitio-version.tar.gz --user

For more information regarding the pip installation, please visit the pip user-guide.

Running the pip installation, from both PyPI or the the direct installation will take care of all dependencies, including msprime (Kelleher, et al. 2016), scipy, numpy.

Pipeline structure

RADinitio is designed as a series of independent functions that simulate different aspects of the RADseq library preparation process. The pipeline can be broken down into three main steps:

Variant generation and processing

Variants are generated with msprime from an user-defined demographic. Independent simulations are run for different chromosomes present in an user-provided reference genome. The simulated variants are then projected against the reference genome to obtain the reference alleles, which are then converted into alternative alleles using an user-defined model.

Extraction of RAD alleles

The reference genome is in silico digested to obtain a series of refence RAD loci. This can be done using a single or double enzyme digestion. The positions of the refernce loci in the genome then used to filter the simulated variants for all samples to include only variants present in RAD loci, improving downstream performance. For each sample, we extract the RAD alleles-the reference RAD loci are modified to include the corresponding genetic variants for each sample. This process can alter a cutsite's sequence, resulting in a dropped allele for that sample.

Simulate library enrichment and sequencing

Once extracted, the alleles for each sample are randomly sampled to obtain paired-end sequences for each allele template. The alleles are sampled with replacement, proportional to the desired sequencing coverage of the library. Each iteration of the sampling is treated as an independent sequence template that is truncated to a random lenth sampled from a simulated insert size distribution. A PCR duplicate distribution, generated from an user defined model, can be applied to the sampling process, resulting in the duplicate sampling of unique template sequences. This process can also introduce random error to the sequence, simulating the generation of PCR errors. Finally, paired end sequence are generated from the each of each template, each with its own unique sequencing error.

The corresponding functions for each stage can be run independently. We do provide a wrapper script, radinitio, that calls the top-level radinitio.simulate() function which runs the pipeline from start to finish. Advanced users can run the pipeline through the Python API, which allows for the generation of more complex demographic models, define finer details of the library preparation process, and running componments of the pipeline independently.

Command line interface

The simplest way to run RADinitio is to execute the module via the radinitio command line wrapper, which then calls the top-level radinitio.simulate() function, which runs all the stages of the pipeline.

Usage & options

The program options are the following:

radinitio --genome path --chromosomes path --out-dir dir [pipeline-stage] [(demographic model options)] [(library options)]

Pipeline stages (these options are mutually exclusive):

--simulate-all : Run all the RADinitio stages (simulate a population, make a library, and sequence) (Default)

--make-population : Simulate and process variants. Produces genome-wide VCF.

--make-library-seq : Simulate and sequence a RAD library. Requires exising make-population run.

--tally-rad-loci : Calculate the number of kept RAD loci in the genome.

Input/Output files

-g, --genome : Path to reference genome (fasta file, may be gzipped). Required

-l, --chromosomes : File containing the list of chromosomes (one per line) to simulate. Required

-o, --out-dir : Path to an output directory where all files will be written. Required

Demographic model (simple island model)

-p, --n-pops: (int) Number of populations (demes) in the island model. (default = 2)

-n, --pop-eff-size : (float) Effective population size of each simulated deme. (default = 5000)

-s, --n-seq-indv : (int) Number of individuals sampled from each population. (default = 10)

Library preparation/sequencing

-b, --library-type : Library type (sdRAD or ddRAD). (default = 'sdRAD')

-e, --enz : Restriction enzyme (SbfI, PstI, EcoRI, BamHI, etc.). (default = 'SbfI')

-d, --enz2 : Second restriction enzyme for double digest (MspI, MseI, AluI, etc.). (default = 'MspI')

-m, --insert-mean : (int) Insert size mean in bp. (default = 350)

-t, --insert-stdev : (int) Insert size standard deviation in bp. (default = 37)

-c, --pcr-cycles : (int) Number of PCR cycles. (default = 0)

-v, --coverage : (int) Sequencing coverage. (default = 20)

r, --read-length : (int) Sequence read length. (default = 150)

make-library-seq()-specific options:

--make-pop-sim-dir : Directory containing a previous radinitio.make_population run. Cannot be the same as out-dir.

Additional options

-V, --version : Print program version.

-h, --help : Display this help message.

Examples

# Simulating a sdRAD library (whole pipeline):
radinitio --simulate-all \
    --genome ./genome/reference.fa.gz \
    --chromosomes ./genome/chrom.list \
    --out-dir ./simulations_sdRAD/ \
    --n-pops 4 --pop-eff-size 2500 --n-seq-indv 10 \
    --library-type sdRAD --enz SbfI --insert-mean 350 --insert-stdev 35 \
    --pcr-cycles 9 --coverage 20 --read-length 150

# Simulating a ddRAD library (whole pipeline):
radinitio --simulate-all \
    --genome ./genome/reference.fa.gz \
    --chromosomes ./genome/chrom.list \
    --out-dir ./simulations_ddRAD/ \
    --n-pops 4 --pop-eff-size 2500 --n-seq-indv 10 \
    --library-type ddRAD --enz PstI --enz2 MspI \
    --insert-mean 350 --insert-stdev 35 \
    --pcr-cycles 9 --coverage 20 --read-length 150

# Make a tally of sdRAD loci
radinitio --tally-rad-loci \
    --genome ./genome/reference.fa.gz \
    --chromosomes ./genome/chrom.list \
    --out-dir ./count_rad_loci/ \
    --library-type sdRAD --enz SbfI

# Make a tally of ddRAD loci
radinitio --tally-rad-loci \
    --genome ./genome/reference.fa.gz \
    --chromosomes ./genome/chrom.list \
    --out-dir ./count_ddrad_loci/ \
    --library-type ddRAD --enz NlaIII --enz2 MluCI \
    --insert-mean 320 --insert-stdev 25

# Simulate a population only
radinitio --make-population \
    --genome ./genome/reference.fa.gz \
    --chromosomes ./genome/chrom.list \
    --out-dir ./simulated_population/ \
    --n-pops 4 --pop-eff-size 2500 --n-seq-indv 10

# Simulate library and sequencing
# Use population from previous simulation
radinitio --make-library-seq \
    --genome ./genome/reference.fa.gz \
    --chromosomes ./genome/chrom.list \
    --out-dir ./SbfI_library/ \
    --make-pop-sim-dir ./simulated_population/ \
    --library-type sdRAD --enz SbfI \
    --insert-mean 350 --insert-stdev 35 \
    --pcr-cycles 9 --coverage 20 --read-length 150

Explanation of options

Different pipeline stages of RADinitio can be run independently:

  • --simulate-all runs all the RADinitio stages. Start with a reference genome and specify chromosomes, simulate a population, prepare a library and sequence. This is the default in the wrapper.

  • --make-population just generates a genome-wide VCF with the variants for a simulated population. A single simulated population can be later sequencing using different library parameters.

  • --make-library-seq simulates library preparatation and sequencing, resulting in paired-end reads for the simulated inidivuals. It requires a population of individuals, likely the output of a previous make-population run.

  • --tally-rad-loci for a given library configuration - library type and enzyme(s) - calculate the number of RAD loci in the genome.

reference.fa.gz is a genome fasta file. RADinitio can simulate using both compressed and uncompressed fasta files.

The --out-dir is the output directory for the simulations. Inside it series of subdirectories and files will be generated (described bellow).

chrom.list is contains a list with all the chromosome ids to simulate. Only the chromosomes on the list will be used as input for the simulations. This is important when working with highly fragmented assemblies or those with many small unplaced scaffolds. The structure of chrom.list is the following:

chromosome_1
chromosome_2
chromosome_3
...

--n-pops, --n-seq-indv, and --pop-eff-size contains the parameters of a simple msprime island demographic model. In this example, we are simulating 4 populations with an effective population size of 2,500 individuals each, from which we sample 10 individuals each. More complex demographic parameters can be generated using additional RADinitio functions. For more information regarding the demographic model parameters, please check the msprime documentation.

For the sdRAD example above (--library-type sdRAD), --enz is the main restriction enzyme used for generating a single-digest RADseq library, as described by the protocols by Baird, et al. 2008 and Etter, et al. 2011. In this example, the restriction enzyme SbfI is used, but other enzymes such as EcoRI, PstI, BamHI, among others, are available. The simulated library will have a mean insert size (--insert-mean) of 350bp, with a standard deviation (--insert-stdev) of +-35bp, and 2x150bp paired end reads (--read-length).

The ddRAD example (--library-type ddRAD) uses a double restriction enzyme combination, as described in the protocol by Peterson, et al. 2012, where --enz is the rare (main) cutter, PstI in this case, and --enz2 is the common (or double) cutter enzyme (MspI). The simulated library will have a mean insert size (--insert-mean) of 350bp, with a standard deviation (--insert-stdev) of 35bp. This size distribution will create a range of insert sizes between 280bp and 420bp (insert mean +- 2 x insert st.deviation). The simulation will produce 2x150bp paired end reads (--read-length).

--pcr-cycles defines a RAD library enriched using 9 cycles of PCR. The library has a 2:1 template molecules to sequenced reads ratio. More complex PCR parameters can be generated using additional RADinitio functions.

--coverage defines the per-locus sequencing depth of the library, in this case 20X.

--make-pop-sim-dir is the directory containing a previous radinitio.make_population simulation. This option is only used in radinitio.make_library_seq and defines the population sampled to simulate library preparation and sequencing. It cannot be the same as the output directory for the run.

RADinitio manual

A complete version of the RADinitio manual including instructions for installation, documentation of command line variables, description of output files, and a tutorial to run the pipeline can be found at the RADinitio website.

Authors

Angel G. Rivera-Colon angelgr2@illinois.edu

Nicolas Rochette rochette@illinois.edu

Julian Catchen jcatchen@illinois.edu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for radinitio, version 1.1.1
Filename, size File type Python version Upload date Hashes
Filename, size radinitio-1.1.1-py3-none-any.whl (46.1 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size radinitio-1.1.1.tar.gz (35.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page