Skip to main content

Bioinformatics Test Data Generator

Project description

biotdg: Bioinformatics Test Data Generator

biotdg can generate mutations based on vcf files for genomes where the chromosomes have different ploidy. It was made to create test genomes for pipelines that correctly handle the ploidy of sex chromosomes. It can also be used to create test data for pipelines that handle triploid species, such as banana, or for pipelines that discover chromosome imbalances, such as trisomy-21 (Down syndrome) and XXY males (Klinefelter syndrome).

biotdg uses a reference genome, a ploidy table and a vcf file to create a “true genome” for a sample. For example, if the ploidy table states that chr21 has a ploidy of 3 then the “true genome” will have three copies of chr21. Each chr21 copy will have its own mutations based on the vcf file.

After creating the “true genome” fasta file. biotdg uses the dwgsim program to generate fastq reads.

Usage

usage: biotdg [-h] [--version] -r REFERENCE --vcf VCF -p PLOIDY_TABLE -s
              SAMPLE_NAME [-z RANDOM_SEED] [-l READ_LENGTH] [-C COVERAGE]
              [-e READ1_ERROR_RATE] [-E READ2_ERROR_RATE]
              [-n MAXIMUM_N_NUMBER] [-o OUTPUT_DIR]

Bioinformatics Test Data Generator

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -r REFERENCE, --reference REFERENCE
                        Reference genome for the sample.
  --vcf VCF             VCF file with mutations.
  -p PLOIDY_TABLE, --ploidy-table PLOIDY_TABLE
                        Tab-delimited file with two columns specifying the
                        chromosome name and its ploidy. By default all
                        chromosomes have a ploidy of 2.
  -s SAMPLE_NAME, --sample-name SAMPLE_NAME
                        Name of the sample to generate. The sample must be in
                        the VCF file.
  -z RANDOM_SEED, --random-seed RANDOM_SEED
                        Random seed for dwgsim (default: 1).
  -l READ_LENGTH, --read-length READ_LENGTH
                        Read length to be used by dwgsim.
  -C COVERAGE, --coverage COVERAGE
                        Average coverage for the generated reads. NOTE: This
                        is multiplied by the ploidy of the chromosome.
  -e READ1_ERROR_RATE, --read1-error-rate READ1_ERROR_RATE
                        Same as -e flag in dwgsim. per base/color/flow error
                        rate of the first read.
  -E READ2_ERROR_RATE, --read2-error-rate READ2_ERROR_RATE
                        Same as -E flag in dwgsim. per base/color/flow error
                        rate of the second read.
  -n MAXIMUM_N_NUMBER, --maximum-n-number MAXIMUM_N_NUMBER
                        Maximum number of Ns allowed in a given read.
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR

Example

Given the following reference.fasta file

>chr1
GATTACA
GATTACA
GATTACA
>chrX
AGTCAGTCAGTC
>chrY
AGAATC

the following ploidy table.tsv

chr1        3
chrX        2
chrY        1

and the following vcf:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=21>
##contig=<ID=chrX,length=12>
##contig=<ID=chrY,length=6>
#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1
chr1        4       .       T       A,C,G   .       .       .       GT      1/2/3
chr1        7       .       A       T       .       .       .       GT      0/1/0
chrX        1       .       A       T       .       .       .       GT      0/1
chrX        2       .       G       T       .       .       .       GT      0/0
chrY        4       .       A       C       .       .       .       GT      1

A “true genome” for sample1 looks like this:

>chr1_0
GATAACAGATTACAGATTACA
>chr1_1
GATCACTGATTACAGATTACA
>chr1_2
GATGACAGATTACAGATTACA
>chrX_0
AGTCAGTCAGTC
>chrX_1
TGTCAGTCAGTC
>chrY_0
AGACTC

Why biotdg and not dwgsim?

dwgsim has excellent capabilities for generating reads that are close to real data. Therefore dwgsim is used by biotdg in this capacity.

dwgsim can also generate mutations randomly and output these in VCF format. It also has the capability to use a VCF to generate mutations. This VCF-based method was not deemed sufficient for the following reasons:

  • Very poorly documented.

  • Only allows ploidy of 1 or 2. There is an option ‘3’ but that does something different.

  • How exactly mutations are generated is unknown. Is it aware of phasing? If so, how does it handle it?

biotdg handles the creation of the “true genome” transparently and then uses dwgsim to generate reads. biotdg can handle genomes with mixed ploidies (as is the case for most species with a sex chromosome) well.

Known limitations

  • Overlapping mutations are not handled properly. (Probably not a concern for generating test data.)

  • Mutations are always generated in a phased manner. This was easier to implement than an unphased manner. It is also more transparent. Some extra work will be required to handle unphased generation of mutations.

  • biotdg is only tested with SNPs. Indels and other variant types were not tested.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biotdg-0.1.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biotdg-0.1.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file biotdg-0.1.0.tar.gz.

File metadata

  • Download URL: biotdg-0.1.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.3

File hashes

Hashes for biotdg-0.1.0.tar.gz
Algorithm Hash digest
SHA256 981d588a3be672fd62ec5eac402a87d5e5f0865d040d1a1b6ae2f8398f840cd1
MD5 b858d1fd61889a5047a0348ab3fe32a3
BLAKE2b-256 cd02bb2281eae8fd2c4f1fd056a9b567b805a45d01d8f6a64e88ee5f5a20d962

See more details on using hashes here.

File details

Details for the file biotdg-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: biotdg-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.3

File hashes

Hashes for biotdg-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f48874e3caa2ba48bf12e2a5de31be287c8106750da546a67625deeb54dc8f3a
MD5 76c769f0fc8731aed6b2ca9ceefedcb6
BLAKE2b-256 be71b52e7aa2e13bb02987d63f42cd78df0ba26af590d3b17aba557dcb4ff1c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page