pywgsim
Project description
pywsgim
pywgsim is a python wrapper around the wgsim short read simulator.
Usage
pywgsim -h
Installation
pip install pywgsim
Changes
The original code for wgsim has been expanded a little bit. The main changes are:
- The information on the mutations introduced by
wgsim
are now generated in GFF format. - There is a new flag called
--fixed
that generates the sameN
number of reads for each chromosome. - The separator character in the read name has been changed from
_
to|
. This follows a more widely accepted standard (i.e. NCBI) and allows identifying the contig name from the read name.
In the default operation of wgsim the N
reads are distribute such to create a uniform coverage across all chromosomes (longer chromosomes get a larger fraction of N)
Mutation output
The output generated by pywgsim
looks like this:
##gff-version 3
#
# N=1000 err_rate=0.02 mut_rate=0.001 indel_frac=0.15000001 indel_ext=0.25 size=500 std=50 len1=100 len2=100 seed=1606965870
#
NC_001416.1 wgsim snp 1047 1047 . + . Name=A/C;Ref=A;Alt=C;Type=hom
NC_001416.1 wgsim snp 1308 1308 . + . Name=C/Y;Ref=C;Alt=Y;Type=het
NC_001416.1 wgsim snp 1533 1533 . + . Name=G/T;Ref=G;Alt=T;Type=hom
NC_001416.1 wgsim snp 2472 2472 . + . Name=C/M;Ref=C;Alt=M;Type=het
NC_001416.1 wgsim snp 2964 2964 . + . Name=A/M;Ref=A;Alt=M;Type=het
NC_001416.1 wgsim snp 5375 5375 . + . Name=G/R;Ref=G;Alt=R;Type=het
New read names
The read names are now of the form:
@NC_002945.4|1768156|1768694|0:0:0|4:0:0|4
Where:
NC_002945.4
is the contig name that the fragment was generated from.1768156
is the left-most position of the fragment.1768694
is the right-most position of the fragment.0:0:0
are the number of errors, substitutions and indels in the left-most read of the pair.4:0:0
are the number of errors, substitutions and indels in the right-most read of the pair.4
is the read pair number, unique, per contig.
Help
$ pywgsim -h
prints:
usage: pywgsim [-h] [-a 1.fq] [-b 2.fq] [-N 1000] [-f] [-e 0.02] [-r 0.001]
[-R 0.15] [-X 0.25] [-D 500] [-s 50] [-S 0]
genome
positional arguments:
genome the FASTA reference sequence
optional arguments:
-h, --help show this help message and exit
-a 1.fq, --r1 1.fq name for first in pair
-b 2.fq, --r2 2.fq name for second in pair
-N 1000, --num 1000 number of read pairs
-f, --fixed each chromosome gets N sequences
-e 0.02, --err 0.02 the base error rate
-r 0.001, --mut 0.001
rate of mutations
-R 0.15, --frac 0.15 fraction of indels
-X 0.25, --ext 0.25 probability an indel is extended
-D 500, --dist 500 outer distance between the two ends
-s 50, --stdev 50 standard deviation
-S 0, --seed 0 seed for the random generator
API
The interface to wgsim
can be made in a single function call
from pywgsim import wgsim
wgsim.core(r1="r1.fq", r2="r2.fq", ref="genome.fa", err_rate=0.02, mut_rate=0.001, indel_frac=0.15, indel_ext=0.25, max_n=0.05, is_hap=0, N=100000, dist=500, stdev=50, size_l=100, size_r=100, is_fixed=0, seed=0)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pywgsim-0.0.7.tar.gz
(63.4 kB
view hashes)