Skip to main content

Reorients assembled microbial sequences

Project description

CI codecov Code style: black

dnaapler

Description

dnaapler is a simple python program that takes a single nucleotide input sequence (in FASTA format), finds the desired start gene using blastx against an amino acid database, checks that the start of a gene is found, and if so, then reorients the chromosome to begin with this genes on the forward strand.

It was designed to replicate the reorientation functionality of Unicycler with dnaA, but for FASTA input and for long-read first assembled chromosomes. I have extended it to work with plasmids and phages, or for any input FASTA desired with plassembler custom or plassembler mystery.

For bacterial chromosomes, dnaapler chromosome should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as Trycycler, Dragonflye or my own pipleine hybracter.

Installation

dnaapler requires only BLAST as an external dependency.

Installation from conda is recommended as this will install BLAST automatically when it becomes available.

Conda

conda install -c bioconda dnaapler

Pip

pip install dnaapler

You will need to install BLAST separately.

e.g. conda install -c bioconda blast

Usage

Usage: dnaapler [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  chromosome  Reorients your sequence to begin with the dnaA chromosomal...
  citation    Print the citation(s) for this tool
  custom      Reorients your sequence with a custom database
  mystery     Reorients your sequence with a random gene
  phage       Reorients your sequence to begin with the terL large...
  plasmid     Reorients your sequence to begin with the repA replication...
Usage: dnaapler chromosome [OPTIONS]

Reorients your sequence to begin with the dnaA chromosomal replication
initiation gene

Options:
-h, --help             Show this message and exit.
-V, --version          Show the version and exit.
-i, --input PATH       Path to input file in FASTA format  [required]
-o, --output PATH      Output directory   [default: output.dnaapler]
-t, --threads INTEGER  Number of threads to use with BLAST.  [default: 1]
-p, --prefix TEXT      Prefix for output files.  [default :dnaapler]
-f, --force            Force overwrites the output directory

Databases

dnaapler chromosome uses 733 proteins downloaded from Swissprot with the query "Chromosomal replication initiator protein DnaA" on 24 May 2023 as its database for dnaA.

dnaapler plasmid uses the repA database curated by Ryan Wick in Unicycler.

dnaapler phage uses a terL database curated using PHROGs. I downloaded all the AA sequences of the 55 phrogs annotated as 'large terminase subunit', combined them depduplicated them using seqkit seqkit rmdup -s -o terL.faa phrog_terL.faa.

dnaapler custom uses a custom amino acid FASTA format gene(s) that you specify using -c.

The matching is strict - it requires a strong BLAST match (e-value 1E-10), and the first amino acid of a BLAST hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages.

For the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.

If you try dnaapler on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using dnaapler custom.

After this issue, dnaapler mystery was added. It predicted all ORFs in the input, then picks a random sequence to re-orient your sequence with.s

Motivation

  1. I couldn't get Circlator to work and it is no longer supported.
  2. berokka doesn't orient chromosomes to begin with dnaa.
  3. After reading Ryan Wick's masterful bacterial genome assembly tutorial, I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some "complete" assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not "circular" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's rotate_circular_gfa.py script, without the requirement of strict circularity.
  4. While researching MGEs in S. aureus whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline.
  5. It's probably good to have all your sequences start at the same location for synteny analyses.

Polishing Afterwards

I recommend that you undertake 2 rounds of polishing. The first prior to running dnaapler, and then again after. I'd highly recommend a conservative polisher like Polypolish if you have short reads, otherwise 2 rounds of medaka.

Acknowledgements

Thanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to Michael Hall, whose repository tbpore I took and adapted a lot of scaffolding code from because he writes really nice code, Rob Edwards, because everything always comes back to phages, and especially Vijini Mallawaarachchi who taught me how to actually do something resembling legitimate software development.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnaapler-0.1.0.tar.gz (2.7 MB view hashes)

Uploaded Source

Built Distribution

dnaapler-0.1.0-py3-none-any.whl (2.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page