Skip to main content

Amplicon read simualtor

Project description

Bygul: Amplicon & Metagenomics Read Simulator

Bygul is a Python 3 tool designed for simulating sequencing reads in wastewater surveillance and other metagenomic applications. It allows users to simulate complex multi-sample datasets with customizable proportions using industry-standard backends like wgsim and mason.


🏗 Installation

Bygul requires Python 3. Since it relies on external simulators (wgsim and mason), we recommend using Conda to manage dependencies.For more info on wgsim and mason simulator please check their documentations.

Option 1: Via Conda (Recommended)

conda create -n bygul bioconda::bygul

Option 2: Via PyPI

pip install bygul

Note: Some binary dependencies (wgsim/mason) may need to be installed manually or built from source if using this method.

Option 3: Local Build from Source

git clone [https://github.com/andersen-lab/Bygul](https://github.com/andersen-lab/Bygul)
cd Bygul
pip install -e .

🧬 Usage: Amplicon Sequencing Mode

Use this mode when simulating specific genomic regions defined by a primer set.

Basic Command

bygul simulate-proportions [SAMPLE1.fasta,SAMPLE2.fasta] --primers [primer.bed] --reference [reference.fasta] --proportions [0.8,0.2] --outdir [output_dir]

Advanced Examples

  • Random Proportions & Mismatches: Simulate with random proportions and allow up to 2 SNPs in primer regions.
    bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --outdir results/ --maxmismatch 2
    
  • Switching Simulators: Use mason instead of the default wgsim.
    bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --simulator mason
    
  • Custom Error Rates & Lengths: Pass simulator-specific parameters (e.g. indel fraction -R) directly.
    bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta -R 0.01
    

🌍 Usage: Metagenomics Mode

Simulate reads from entire samples without requiring a primer BED file or a reference sequence.

Basic Metagenomics Simulation

bygul simulate-proportions sample1.fasta,sample2.fasta --outdir results/ --simulation_mode metagenomics

Metagenomics with Specific Parameters

bygul simulate-proportions sample1.fasta,sample2.fasta --proportions 0.5,0.5 --outdir results/ --simulation_mode metagenomics --simulator mason --illumina-read-length 200

📝 Technical Notes

Parameter Handling

Bygul acts as a wrapper. While most flags are passed directly to the underlying simulators, the following are managed directly by Bygul for more realistic simulations(amplicon simulation mode only):

  • --readcnt: Number of reads per amplicon.
  • --wgsim_insert_size: Insert size for wgsim.
  • --wgsim_read_length / --wgsim_error_rate.

To see all available backend flags, run:

wgsim --help
mason_simulator --help

Please note that some dependencies are not available through pypi. You need to install them using conda or build from source.

Amplicon sequencing mode

Example commands

Run the tool using the following command.

If you are just checking your primers without wanting to run the simulation, you can provide a multi-fasta file including all the sequences along with the primer file and the reference used to generate the primer file. This will create amplicon_stats.csv file as described below.

bygul check-primers sequences.fasta primer.bed reference.fasta

Example simualtion command

bygul simulate-proportions [SAMPLE1.fasta,SAMPLE2.fasta,..] --primers [primer.bed] --reference [reference.fasta] --proportions [0.8,0.2,..] --outdir [output_directory]

Simulate reads from different samples without defining proportions (will be assigned randomly, proportions can be found in results/sample_proportions.txt) and allowing upto 2 SNPs mistmatches in the primer regions.

bygul simulate-proportions sample.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --outdir results/ --maxmismatch 2

Simulate reads with user-defined proportions and specifing read simulator. bygul uses wgsim as a simulator but you can change it to mason.

bygul simulate-proportions sample.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --proportions 0.2,0.8 --simulator mason

Simulate reads with user-defined proportions and number of reads per amplicon.

bygul simulate-proportions sample.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --proportions 0.2,0.8 --readcnt 1000

Simulate reads with additional parameters such as base error rate, read length and indels fraction

bygul simulate-proportions sample.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --proportions 0.2,0.8 --readcnt 1000 -e 0.001 -1 400 -2 400 -R 0.01

Notes

Number of reads per amplicon

It is recommended to define the number of reads per amplicon to be greater than the number of contigs in your amplicon file. This is particularly important when your primers are designed for whole genome sequencing, where each amplicon may contain a substantial number of contigs. Setting too few reads per amplicon may result in empty read files for certain amplicons, leading to incomplete simulated reads.

Primer bed file

Please remember that the primer file must contain a column containing primer sequence. The maximum number of mismatches allowed for each primer sequence is 1 SNP. To change this number, you may use the --maxmismatches flag.

Complete set of available parameters

To learn more about how to adjust other parameters for the simulator please read the documentation for wgsim and mason simulator. Users can pass any simulator parameter directly in their command. The only parameters set through bygul are --readcnt and --wgsim_insert_size for amplicon sequencing mode.

Simulated reads output

Simulated reads from all samples are located in provided_output_path/reads.fastq

Information about amplicon dropouts

In order to find more about amplicon dropouts, please refer to provided_output_path/sample_name/amplicon_stats.csv file. Please note that primer_seq_x and primer_seq_y define the left and right primer sequence whereas left_match and right_match shows the actual sequence found in the sample for a better comparison of mismatching bases in the primer sequence. Additionally, if there are any ambiguous bases present in the matching sequence, the ambiguous_bases value returns true.

Metagenomics mode

Users can now simulate reads from different samples in a metagenomics setting without specifying a primer bed file. Providing a reference sequence is not required for this setting.

Example commands

Simulate reads from different samples without defining proportions (will be assigned randomly, proportions can be found in results/sample_proportions.txt).

bygul simulate-proportions sample.fasta,sample2.fasta --outdir results/ --simulation_mode metagenomics

Specify proportions for each sample and add other simulator specific parameters. To access simulator parameters, please read wgsim and mason documentation.

bygul simulate-proportions sample.fasta,sample2.fasta --proportions 0.5,0.5 --outdir results/ --simulation_mode metagenomics --simulator mason --illumina-read-length 200

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bygul-3.1.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bygul-3.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file bygul-3.1.0.tar.gz.

File metadata

  • Download URL: bygul-3.1.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bygul-3.1.0.tar.gz
Algorithm Hash digest
SHA256 ffd93e247821ce9c357d928526c79e4adf3959273c82cdee7c9d5f121b58c819
MD5 174c843f0f9a0c9acfc0c6ce4f3ea260
BLAKE2b-256 1a49ec87b73e9e2f230aec71e0d985b6e7f6edc1581f2adda34789f3a732e644

See more details on using hashes here.

Provenance

The following attestation bundles were made for bygul-3.1.0.tar.gz:

Publisher: github_actions.yml on andersen-lab/Bygul

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bygul-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: bygul-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bygul-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 305f1d42236598a7cce7ca767bc518baebfd285bd0c764a60c71c14e0f126580
MD5 5942dccf085c085602c14e3e38c8b6f3
BLAKE2b-256 96fa44d7096dbb088cd0a2b4778c3ceb629167c06fad50be79e9a1477fbd748b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bygul-3.1.0-py3-none-any.whl:

Publisher: github_actions.yml on andersen-lab/Bygul

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page