Skip to main content

Reorients assembled microbial sequences

Project description

CI codecov Code style: black

Anaconda-Server Badge Bioconda Downloads PyPI version Downloads

dnaapler

Quick Start

# creates conda environment with dnaapler
conda create -n dnaapler_env dnaapler

# activates conda environment
conda activate dnaapler_env

# runs dnaapler chromosome
dnaapler chromosome -i input.fasta -o output_directory_path -p my_bacteria_name -t 8

Table of Contents

Description

Dnaapler Figure

dnaapler is a simple python program that takes a single nucleotide input sequence (in FASTA format), finds the desired start gene using blastx against an amino acid sequence database, checks that the start codon of this gene is found, and if so, then reorients the chromosome to begin with this gene on the forward strand.

It was originally designed to replicate the reorientation functionality of Unicycler with dnaA, but for for long-read first assembled chromosomes. I have extended it to work with plasmids (dnaapler plasmid) and phages (dnaapler phage), or for any input FASTA desired with dnaapler custom, dnaapler mystery or dnaapler nearest.

For bacterial chromosomes, dnaapler chromosome should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as Trycycler, Dragonflye or my own pipeline hybracter.

Additionally, you can also reorient multiple bacterial chromosomes/plasmids/phages at once using the dnaapler bulk subcommand.

Documentation

The full documentation for dnaapler can be found here.

Commands

  • dnaapler chromosome: Reorients your sequence to begin with the dnaA chromosomal replication initiator gene
  • dnaapler plasmid: Reorients your sequence to begin with the repA plasmid replication initiation gene
  • dnaapler phage: Reorients your sequence to begin with the terL large terminase subunit gene
  • dnaapler custom: Reorients your sequence to begin with a custom amino acid FASTA format gene that you specify
  • dnaapler mystery: Reorients your sequence to begin with a random CDS
  • dnaapler nearest: Reorients your sequence to begin with the first CDS (nearest to the start). Designed for fixing sequences where a CDS spans the breakpoint.
  • dnaapler bulk: Reorients multiple contigs to begin with the desired start gene - either dnaA, terL, repA or a custom gene.

Installation

dnaapler requires only BLAST v2.9 or higher as an external dependency.

Installation from conda is recommended as this will install BLAST automatically.

Conda

dnaapler is available on bioconda.

conda install -c bioconda dnaapler

Pip

You can also install dnaapler with pip.

pip install dnaapler

You will need to install BLAST separately.

e.g.

conda install -c bioconda blast>=2.9

Usage

Usage: dnaapler [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  bulk        Reorients multiple genomes to begin with the same gene
  chromosome  Reorients your genome to begin with the dnaA chromosomal...
  citation    Print the citation(s) for this tool
  custom      Reorients your genome with a custom database
  mystery     Reorients your genome with a random CDS
  nearest     Reorients your genome the begin with the first CDS as...
  phage       Reorients your genome to begin with the terL large...
  plasmid     Reorients your genome to begin with the repA replication...
Usage: dnaapler chromosome [OPTIONS]

Reorients your genome to begin with the dnaA chromosomal replication
initiation gene

Options:
-h, --help               Show this message and exit.
-V, --version            Show the version and exit.
-i, --input PATH         Path to input file in FASTA format  [required]
-o, --output PATH        Output directory   [default: output.dnaapler]
-t, --threads INTEGER    Number of threads to use with BLAST  [default: 1]
-p, --prefix TEXT        Prefix for output files  [default: dnaapler]
-f, --force              Force overwrites the output directory
-e, --evalue TEXT        e value for blastx  [default: 1e-10]
-a, --autocomplete TEXT  Choose an option to autocomplete reorientation if
                         BLAST based approach fails. Must be one of: none,
                         mystery or nearest [default: none]
--seed_value INTEGER     Random seed to ensure reproducibility.  [default:
                         13]

The reoriented output FASTA will be {prefix}_reoriented.fasta in the specified output directory.

Example Usage

dnaapler chromosome -i input.fasta -o output_directory_path -p my_bacteria_name -t 8
dnaapler phage -i input.fasta -o output_directory_path -p my_phage_name -t 8
dnaapler plasmid -i input.fasta -o output_directory_path -p my_plasmid_name -t 8
dnaapler custom -i input.fasta -o output_directory_path -p my_genome_name -t 8 -c my_custom_database_file
dnaapler mystery -i input.fasta -o output_directory_path -p my_genome_name
dnaapler nearest -i input.fasta -o output_directory_path -p my_genome_name
# to reorient multiple bacterial chromosomes
dnaapler bulk -i input_file_with_multiple_chromosomes.fasta -m chromosome -o output_directory_path -p my_genome_name 

Databases

dnaapler chromosome uses 584 proteins downloaded from Swissprot with the query "Chromosomal replication initiator protein DnaA" on 24 May 2023 as its database for dnaA. All hits from the query were also filtered to ensure "GN=dnaA" was included in the header of the FASTA entry.

dnaapler plasmid uses the repA database curated by Ryan Wick in Unicycler.

dnaapler phage uses a terL database curated using PHROGs. I downloaded all the AA sequences of the 55 phrogs annotated as 'large terminase subunit', combined them depduplicated them using seqkit seqkit rmdup -s -o terL.faa phrog_terL.faa.

dnaapler custom uses a custom amino acid FASTA format file that you specify using -c.

The matching is strict - it requires a strong BLASTx match (default e-value 1E-10), and the first amino acid of a BLASTx hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages.

For the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.

If you try dnaapler on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using dnaapler custom.

After this issue, dnaapler mystery was added. It predicts all ORFs in the input using pyrodigal, then picks a random gene to re-orient your sequence with.

Motivation

  1. I couldn't get Circlator to work and it is no longer supported.
  2. berokka doesn't orient chromosomes to begin with dnaa.
  3. After reading Ryan Wick's masterful bacterial genome assembly tutorial, I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some "complete" long read bacterial assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not "circular" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's rotate_circular_gfa.py script, without the requirement of strict circularity.
  4. While researching MGEs in S. aureus whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline.
  5. It's probably good to have all your sequences start at the same location for synteny analyses.

Acknowledgements

Thanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to Michael Hall, whose repository tbpore I took and adapted a lot of scaffolding code from because he writes really nice code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnaapler-0.2.0.tar.gz (2.7 MB view hashes)

Uploaded Source

Built Distribution

dnaapler-0.2.0-py3-none-any.whl (2.7 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page