pywgsim
Project description
pywgsim
pywgsim is a modified version of the wgsim short read simulator.
The code for wgsim
has been modified to allow visualizing the simulated mutations as a GFF file.
The package provides both a python wrapper and standalone compiled executables for Linux and MacOS.
Installation
Using pip
pip install pywgsim
PyPI page: https://pypi.org/project/pywgsim/
Using conda
conda install -c conda-forge -c bioconda pywgsim
Usage
$ pywgsim -h
prints:
usage: pywgsim [-h] [-e 0.02] [-D 500] [-s 50] [-N 1000] [-1 70] [-2 70] [-r 0.001] [-R 0.15]
[-X 0.25] [-S 0] [-A 0.05] [-f] [-v] [-g None]
genome [read1] [read2]
Short read simulator for paired end reads based on wgsim.
positional arguments:
genome FASTA reference sequence
read1 FASTQ file for first in pair (read1.fq)
read2 FASTQ file for second in pair (read2.fq)
options:
-h, --help show this help message and exit
-e 0.02, --err 0.02 the base error rate
-D 500, --dist 500 outer distance between the two ends
-s 50, --stdev 50 standard deviation
-N 1000, --num 1000 number of read pairs
-1 70, --L1 70 length of the first read
-2 70, --L2 70 length of the second read
-r 0.001, --mut 0.001
rate of mutations
-R 0.15, --frac 0.15 fraction of indels
-X 0.25, --ext 0.25 probability an indel is extended
-S 0, --seed 0 seed for the random generator
-A 0.05, --amb 0.05 disregard if the fraction of ambiguous bases higher than FLOAT
-f, --fixed each chromosome gets N sequences
-v, --version print version number
-g None, --gff None GFF output file (default: stdout)
Changes compared to wgsim
The original code for wgsim has been modified as follows:
- The output describing the mutations introduced by
wgsim
are generated in GFF format. - The separator character in the read name has been changed from
_
to|
. - There is a new flag called
--fixed
that generates the sameN
number of reads for each chromosome.
Read naming
The read naming now follows a more widely accepted convention (i.e. NCBI) and allows for contigs with underscores in them. In addition the visual inspection of the read names is easier:
@NC_002945.4|1768156|1768694|0:0:0|4:0:0|4
Fixed mode
In the default operation of wgsim the N
reads are distributed such to create a uniform coverage across all chromosomes (longer chromosomes get a larger fraction of N
).
When the --fixed
mode is enabled N
reads will be generated for each chromosome. The --fixed
mode was introduced to simplify the evaluation of classifiers. Since the same number of reads is generated from each input sequence it becomes simpler the assess the quality of classifications (i.e. how many out of N
were classified correctly)
Mutation output
The tool simulates mutations assuming a diploid genome. The output generated by pywgsim
will look like this:
##gff-version 3
#
# N=10000 err_rate=0 mut_rate=0.001 indel_frac=0.15000001 indel_ext=0.25 size=500 std=50 len1=70 len2=70 seed=1607013056
#
NC_001416.1 wgsim snp 89 89 . + . Name=A/R;Ref=A;Alt=R;Type=het
NC_001416.1 wgsim snp 2825 2825 . + . Name=-/A;Ref=-;Alt=A;Type=het
NC_001416.1 wgsim snp 3712 3712 . + . Name=G/A;Ref=G;Alt=A;Type=hom
NC_001416.1 wgsim snp 4622 4622 . + . Name=G/-;Ref=G;Alt=-;Type=hom
Interpretation:
A/R
means heterozygous mutations withA/A
andA/G
alleles.-/A
means an insertion of aA
relative to the reference, the type field indicates heterozygous mutation.G/A
means homozygous mutations withG/A
alleles in both copies.G/-
means a deletion of aG
from the reference, the type field indicates homozygous mutation.
Ambiguity codes
The table shows base, alternate and complement of the ambiguity code:
A A T
C C G
G G C
T/U T A
M A or C K
R A or G Y
W A or T W
S C or G S
Y C or T R
K G or T M
V A or C or G B
H A or C or T D
D A or G or T H
B C or G or T V
N G or A or T or C N
Read name conventions
The read names are now of the form:
@NC_002945.4|1768156|1768694|0:0:0|4:0:0|4
Where:
NC_002945.4
is the contig name that the fragment was generated from.1768156
is the left-most position of the fragment.1768694
is the right-most position of the fragment.0:0:0
are the number of errors, substitutions and indels in the left-most read of the pair.4:0:0
are the number of errors, substitutions and indels in the right-most read of the pair.4
is the read pair number, unique, per contig.
API
The C interface to wgsim
is accessible as a single function call
from pywgsim import wgsim
wgsim.core(r1="read1.fq", r2="read2.fq", ref="genome.fa", err_rate=0.02, mut_rate=0.001, indel_frac=0.15, indel_ext=0.25, max_n=0.05, is_hap=0, N=100000, dist=500, stdev=50, size_l=100, size_r=100, is_fixed=0, seed=0, gff=None)
The function creates the files r1
and r2
.
New in v0.6.1:
- The
gff
argument allows you to specify a filename for the GFF mutation output. Ifgff
is not provided (or isNone
), the GFF output is written to stdout (the previous behavior). This makes it possible to use pywgsim safely in multithreaded code without stdout redirection.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pywgsim-0.6.0.tar.gz
.
File metadata
- Download URL: pywgsim-0.6.0.tar.gz
- Upload date:
- Size: 117.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
863f2de8e989dbcd667b01c5d3d8f82f9762c1086928016c467f9743dfa3f82f
|
|
MD5 |
2b2bca8d13208740f3fb9041d4941ef7
|
|
BLAKE2b-256 |
0a9ef5e4521927b7411969ce038bd4a7266f642951c6d0f7c7407be4eba5ae54
|