Skip to main content

Identify and calculate find variant rate between a bacterial genome sequence and either paired-end reads or another genome sequence

Project description

Al2var

al2var identifies variants and caclulates the variant rate between a bacterial genome sequence and either paired-end reads or another genome sequence. al2var can also estimate the number of errors in the assembly.

Introduction

The primary purpose of al2var is to align either paired-end reads or a genome assembly to a given sequence. Al2var aligns the input using either Bowtie2 (paired-end reads) or minimap2 (genome sequence) then interprets the mappings to determine the number of variants and the variant rate.

Al2var can also be used to estimate the number of errors in an assembly. It is recommended that high quality genome assemblies have fewer than 1 error per 100,000 bases (Chain, 2009).

Reference-based error estimation

In reference-based error analysis, a sample is compared to a trusted representative of the same species to identify variants. Either genome could be used as the reference or the query for al2var, but the error rate will be normalized to the length of the reference. This application is not recommended for species that are diverse or that lack a sufficiently close and trusted representative sequence because true variants may be misinterpreted as errors.

de novo error estimation

Errors could also be estimated without a species representative using the al2var bowtie2 mode, where paired-end reads are aligned to a given genome assembly and the variants are interpreted as errors. The paired-end approach to estimating errors requires that the genome and paired-end reads originated from the same sample. In addition, this method assumes that the consensus of multiple Illumina reads at a given location are correct. Though highly accurate, Illumina reads are not impervious to errors so this method is meant to provide an error estimate. To reduce the number of false positives, trim low quality ends off of Illumina reads and use higher read coverage.

Getting Started

Requirements

  • Python 3.7+
  • bcftools v1.9+
  • samtools v1.6+
  • bowtie2 v2.2.5+
  • minimap2 v2.21+

al2var was specifically tested with

  1. python v3.7.16, bcftools v1.9, samtools v1.6, bowtie2 v2.2.5, and minimap2 v2.21
  2. python v3.10.2, bcftools v1.14, samtools v1.14, bowtie2 v2.5.1, and minimap2 v2.24

Installation

It is recomended by not required to install al2var in a conda environment.

Creating a possible conda environment for al2var:

conda create -n al2var python=3.10.2 bcftools=1.14 samtools=1.14 bowtie2=2.5.1 minimap2=2.24

al2var is available on PyPI and can be installed using pip.

pip install al2var

OR clone from GitHub repository.

Output

al2var will create a directory to organize all of the generated output in various subdirectories.

  • Subdirectory vcf/ harbors two files. The file ending in 000.vcf is a sorted vcf that contains a record for each position in the reference sequence regardless of whether the query sequence had the same or a different genotype. If using the minimap2 mode, the genotype column will reflect the genotype of the query sequence at the mapped position. If using the bowtie2 mode, the genotype will reflect the consensus of the reads that mapped to that given position. The second file ending in var.vcf contains only the variants between the reference and the query sequence or input reads.
  • report.txt file reports on three stats: the alignment rate of the query to the reference, the number of variants between the input, and the variant rate. The variant rate is calculated from the reference length and normalized to 100kb.
  • Subdirectory bam/ contains the alignment information in bam format. These files can be viewed in an interactive genome browser such as IGV. These files can be automatically deleted during runtime using the --cleanup flag.
  • The subdirectories indexes/, mpileup/, sam/, and unconc/ contain various intermediary files that may be of interest to the user. Some of these files can be large (i.e. .sam files) and can be automatically deleted during runtime using the --cleanup flag.

Usage

Regardless of mode, al2var requires a reference sequence in fasta format (any extension is acceptable).

Paired-end Mode

When using paired-end reads, al2var aligns the reads to the reference using bowtie2. The paired-end reads must be in fastq format and can be zipped or not.

Basic usage:

al2var bowtie2 -r reference.fasta -1 illumina_raeds_pair1.fastq.gz -2 illumina_reads_pair2.fastq.gz

By default, the bowtie2 alignment step will use a seed sequence of 22bp and will not allow any mismatches in the seed sequence. The lenth of the seed sequence can be modified with the -l or --length_seed flag and the number of permitted mismatches can be increased to 1 using the -n or --num_mismatch flag. Increasing the number of possible mismatches in the seed sequence increases sensitivity but at the expense of runntime and it is not recommended to allow for more than 1 mismatch.

To use a seed sequence of 30bp and allow 1 mismatch, use the following command:

al2var bowtie2 -r reference.fasta -1 illumina_raeds_pair1.fastq.gz -2 illumina_reads_pair2.fastq.gz --length_seed 30 --num_mismatch 1

Query sequence Mode

When using a second genome, al2var aligns the entire sequence to the reference using minimap2. This second genome sequence must be in fasta format (with any extension) and can represent a genome assembly with any number of contigs or an excerpt of a genome.

Basic usage:

al2var minimap2 -r reference.fasta -q query.fasta

General usage

By default al2var creates an output file named out_al2var/ in the current working directory to organize the output files. The name of this file can be changed using the -o or --output_directory flag and the location can be changed using the -p or --output_path flag. The prefix of the report file can be specified using the -s or --savename flag.

Since the intermediary files can be quite large and may not be of use to the user, al2var provides the user with two options for removing intermediary files during runtime. Use of the -c or --cleanup flag will remove all data within the directories bam/, indexes/, mpileup/, sam/, and unconc/. This action will leave the al2var.log and report.txt files along with the vcf/ subdirectory. Additional use of the -b or --keep_bam flags will keep the bam/ subdirectory rather than deleting it.

References

Chain, P. S. G., et al. "Genome project standards in a new era of sequencing." Science 326.5950 (2009): 236-237.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

al2var-1.0.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

al2var-1.0.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file al2var-1.0.0.tar.gz.

File metadata

  • Download URL: al2var-1.0.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for al2var-1.0.0.tar.gz
Algorithm Hash digest
SHA256 91fb310400a4859abae95d9fc04556b4c99f208a060a6f219d56fbc4be28b271
MD5 d21f7f0931a39a638a1c529462163045
BLAKE2b-256 5b59593d6a22fb0507d0a07c3607813cfa2cda5a5834ed48f6752ea81e648d5f

See more details on using hashes here.

File details

Details for the file al2var-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: al2var-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for al2var-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 272a11d2102b29d4ea9ae8f7619193da305d7d354396b0645b0a0db1b30f8514
MD5 c1e34027252d614ed34c78faa61728ed
BLAKE2b-256 1b6605660049ca336fbbfda3e11253a8e6db00b16393f6025e0e45569d8ff322

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page