Skip to main content

Reorients assembled microbial sequences

Project description

CI codecov Code style: black

dnaapler

Description

dnaapler is a simple python program that takes a single nucleotide input sequence (in FASTA format), finds the desired start gene using blastx against an amino acid database, checks that the start of a gene is found, and if so, then reorients the chromosome to begin with this genes on the forward strand.

It was designed to replicate the reorientation functionality of Unicycler with dnaA, but for FASTA input and for long-read first assembled chromosomes. I have extended it to work with plasmids and phages, or for any input FASTA desired with plassembler custom or plassembler mystery.

For bacterial chromosomes, dnaapler chromosome should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as Trycycler, Dragonflye or my own pipleine hybracter.

Installation

dnaapler requires only BLAST as an external dependency.

Installation from conda is recommended as this will install BLAST automatically when it becomes available.

Conda

conda install -c bioconda dnaapler

Pip

pip install dnaapler

You will need to install BLAST separately.

e.g. conda install -c bioconda blast

Usage

Usage: dnaapler [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  chromosome  Reorients your sequence to begin with the dnaA chromosomal...
  citation    Print the citation(s) for this tool
  custom      Reorients your sequence with a custom database
  mystery     Reorients your sequence with a random gene
  phage       Reorients your sequence to begin with the terL large...
  plasmid     Reorients your sequence to begin with the repA replication...
Usage: dnaapler chromosome [OPTIONS]

Reorients your sequence to begin with the dnaA chromosomal replication
initiation gene

Options:
-h, --help             Show this message and exit.
-V, --version          Show the version and exit.
-i, --input PATH       Path to input file in FASTA format  [required]
-o, --output PATH      Output directory   [default: output.dnaapler]
-t, --threads INTEGER  Number of threads to use with BLAST.  [default: 1]
-p, --prefix TEXT      Prefix for output files.  [default :dnaapler]
-f, --force            Force overwrites the output directory

Databases

dnaapler chromosome uses 733 proteins downloaded from Swissprot with the query "Chromosomal replication initiator protein DnaA" on 24 May 2023 as its database for dnaA.

dnaapler plasmid uses the repA database curated by Ryan Wick in Unicycler.

dnaapler phage uses a terL database curated using PHROGs. I downloaded all the AA sequences of the 55 phrogs annotated as 'large terminase subunit', combined them depduplicated them using seqkit seqkit rmdup -s -o terL.faa phrog_terL.faa.

dnaapler custom uses a custom amino acid FASTA format gene(s) that you specify using -c.

The matching is strict - it requires a strong BLAST match (e-value 1E-10), and the first amino acid of a BLAST hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages.

For the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.

If you try dnaapler on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using dnaapler custom.

After this issue, dnaapler mystery was added. It predicted all ORFs in the input, then picks a random sequence to re-orient your sequence with.s

Motivation

  1. I couldn't get Circlator to work and it is no longer supported.
  2. berokka doesn't orient chromosomes to begin with dnaa.
  3. After reading Ryan Wick's masterful bacterial genome assembly tutorial, I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some "complete" assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not "circular" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's rotate_circular_gfa.py script, without the requirement of strict circularity.
  4. While researching MGEs in S. aureus whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline.
  5. It's probably good to have all your sequences start at the same location for synteny analyses.

Polishing Afterwards

I recommend that you undertake 2 rounds of polishing. The first prior to running dnaapler, and then again after. I'd highly recommend a conservative polisher like Polypolish if you have short reads, otherwise 2 rounds of medaka.

Acknowledgements

Thanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to Michael Hall, whose repository tbpore I took and adapted a lot of scaffolding code from because he writes really nice code, Rob Edwards, because everything always comes back to phages, and especially Vijini Mallawaarachchi who taught me how to actually do something resembling legitimate software development.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnaapler-0.1.0.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

dnaapler-0.1.0-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file dnaapler-0.1.0.tar.gz.

File metadata

  • Download URL: dnaapler-0.1.0.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dnaapler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 413ab6f0d085e928e22ef292b41248a5753c83f6572e42b97baf02f35d4e4809
MD5 b957e5e8b87537a4159119fcdf06d0b1
BLAKE2b-256 0fb078bf23b7979514194f2b2f6bbad5c7ccb49f9fbb1b3a4ab97c2b5167ed61

See more details on using hashes here.

File details

Details for the file dnaapler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dnaapler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dnaapler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b85b23c1adb7bda666da9d8aae094abb9ec1b82c904c59e127766fafad33504b
MD5 092c346181654af09572736da6fe7173
BLAKE2b-256 a476d3500a64080cc03a93d8628f0541225c6a513f4d7910337f7a24a610b324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page