Skip to main content

Annotation of genomes and contigs

Project description

metaerg.py, version 2.2.X

Metaerg.py annotates genomes or sets of mags/bins from microbial ecosystems (bacteria, archaea, viruses). Input data consists of a nucleotide fasta file with one or more contigs. Output files with annotations are in common formats such as .gff, .gbk, .fasta and .html with predicted genes, their functions and taxonomic classifications.

You can interact with a sample visualization here and here. These visualizations show the annotation of a cyanobacterial genome, Candidatus Phormidium alkaliphilum.

Metaerg was originally developed in perl. It was relatively challenging to install and comes with complex database dependencies. This new python version 2.2 overcomes some of those issues. Also, the annotation pipeline has further evolved and has become more refined.

By using gtdbtk for taxonomic classification of genes and transferring functional annotations from the NCBI, metaerg.py uses a controlled vocabulary for taxonomy and a relatively clean vocabulary for functions. This makes annotations much more concise than the original version of metaerg and many other annotation tools. In addition, metaerg uses NCBI's conserved domain database and RPSBlast to assign genes to subsystems for effective data exploration. Subsystems are a work in progress, and can be expanded and customized as needed.

The Metaerg 2.2 pipeline consists of:

  • (optional) CRISPR regions using Minced.
  • (optional) tRNAs using Aragorn.
  • (required) RNA genes and other non-coding features using Infernal - cmscan and RFAM.
  • (optional) retrotransposons with LTR Harvest - LTRHarvest.
  • (optional) tandem repeats with Tandem Repeats Finder.
  • (optional) other repeat regions with Repeatscout and Repeatmasker.
  • (required) coding genes with Prodigal.
  • (required) annotates taxonomy and functions of RNA and protein genes using Diamond, NCBI blastn and a database of 23,145 bacterial, 11,508 viral and 150 eukaryotic genomes.
  • (required) annotates gene functions using RPSBlast and NCBI's Conserved Domain Database (CDD).
  • (optional) annotates genes involved in production of secondary metabolites using Antismash.
  • (optional) annotates membrane amd translocated proteins using TMHMM and SignalP.
  • (built-in) assigns genes to a built-in database of physiological subsystems.
  • (built-in) presents annotations in datatables/jQuery-based intuititve, searchable, colorful HTML that can be explored in a web browser and copy/pasted into excel.

Usage:

metaerg --contig_file contig-file.fna --database_dir /path/to/metaerg-databases/

To annotate a set of genomes in a given dir (each file should contain the contigs of a single genome):

metaerg --contig_file dir-with-contig-files --database_dir /path/to/metaerg-databases/

Metaerg needs 20-30 min to annotate a 4 Mb genome on a desktop computer.

Installation

For help with installing pipeline programs, have a look at this script for step by step installation instructions/commands.

You can run that script and install everything and more. Briefly, for required programs, install them as follows:

#(infernal) cmsearch 1.1.4 http://eddylab.org/infernal/  
wget http://eddylab.org/infernal/infernal-1.1.4-linux-intel-gcc.tar.gz  
tar -xf infernal-1.1.4-linux-intel-gcc.tar.gz  
mv infernal-1.1.4-linux-intel-gcc infernal  
rm infernal-1.1.4-linux-intel-gcc.tar.gz  

#(prodigal) prodigal 2.6.3 https://github.com/hyattpd/Prodigal  
wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux  
chmod a+x prodigal.linux  
ln -sf prodigal.linux prodigal  

#(ncbi-blast) blastn 2.13.0 https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/  
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gz  
tar -xf ncbi-blast-2.13.0+-x64-linux.tar.gz  
mv ncbi-blast-2.13.0+ ncbi-blast  
rm ncbi-blast-2.13.0+-x64-linux.tar.gz  

#(diamond) diamond 2.0.14 https://github.com/bbuchfink/diamond  
wget https://github.com/bbuchfink/diamond/releases/download/v2.0.15/diamond-linux64.tar.gz  
tar -xf diamond-linux64.tar.gz  
rm diamond-linux64.tar.gz  

Then, make sure they are in your system's $PATH.

To install metaerg, run, usually in a virtual environment:

python -m virtualenv python-env  
source python-env/bin/activate  
pip install metaerg  
deactivate  

Databases

The metaerg annotation databases can be downloaded here and are created from the following sources:

  • gtdbtk is used for its taxonomy
  • NCBI annotations of >40K representative archael and bacterial genomes present in gtdb are sourced directly from the ncbi ftp server.
  • NCBI (refseq) annotations of viral genes are obtained from viral refseq.
  • For Eukaryotes, for each taxon within Amoebozoa, Ancyromonadida, Apusozoa, Breviatea, CRuMs, Cryptophyceae, Discoba, Glaucocystophyceae, Haptista, Hemimastigophora, Malawimonadida, Metamonada, Rhodelphea, Rhodophyta, Sar, Aphelida, Choanoflagellata, Filasterea, Fungi, Ichthyosporea, Rotosphaeridagenomes, one genome is added to the database using ncbi-datasets.
  • RFAM and CDD databases are also used.
  • Specialized function databases - Cant-Hyd and MetaScan.

If you for some reason need to build this database yourself (this is usually not needed as the metaerg database can be downloaded from the link just provided):

metaerg-build-databases --target_dir /path/to/metaerg-databases/ --gtdbtk_dir /path/to/gtdbtk-database/ [--tasks [PVEBRC]]

with tasks:

  • P - build prokaryotes
  • V - build viruses
  • E - build eukaryotes
  • B - build PVE blast databases
  • R - build RFAM
  • C - build CDD
  • S - build specialized functional databases

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaerg-2.2.20.tar.gz (63.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metaerg-2.2.20-py3-none-any.whl (73.7 kB view details)

Uploaded Python 3

File details

Details for the file metaerg-2.2.20.tar.gz.

File metadata

  • Download URL: metaerg-2.2.20.tar.gz
  • Upload date:
  • Size: 63.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.6

File hashes

Hashes for metaerg-2.2.20.tar.gz
Algorithm Hash digest
SHA256 ea82703f381f487b348b6f2e6e143ab857332cd8ab0a2651044266919ca11e27
MD5 18444a020b38b0d0b335a2a91d8799e6
BLAKE2b-256 64c8122c8163fd3c3f0568aaed03f557681acb89cdd3e1a55248ffa28cbaa9b4

See more details on using hashes here.

File details

Details for the file metaerg-2.2.20-py3-none-any.whl.

File metadata

  • Download URL: metaerg-2.2.20-py3-none-any.whl
  • Upload date:
  • Size: 73.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.6

File hashes

Hashes for metaerg-2.2.20-py3-none-any.whl
Algorithm Hash digest
SHA256 50b19204a427fdba7b12961dbf62a1b9e148545190677ceb5b01085beddaa905
MD5 35c9f08b9430cdae7a29ceca9ce3da71
BLAKE2b-256 2be14595952aed95b4b6482f835f42ea3a487749ba5fc700aaf2ccd7a2111292

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page