Skip to main content

Annotation of genomes and contigs

Project description

metaerg.py, version 2.2.X

Metaerg.py annotates genomes or sets of mags/bins from microbial ecosystems (bacteria, archaea, viruses). Input data consists of nucleotide fasta files, one per genome or mag, each with one or more contigs. Output files with annotations are in common formats such as .gff, .gbk, .fasta and .html with predicted genes, their functions and taxonomic classifications.

You can interact with a sample visualization here and here. These visualizations show the annotation of a cyanobacterial genome, Candidatus Phormidium alkaliphilum. Unfortunately the interacive search box does not work with the github html visualization, so you need to download the html
files to your computer (i.e. using "git clone ..."), to try out the interactive part.

Metaerg was originally developed in perl. It was relatively challenging to install and comes with complex database dependencies. This new python version 2.2 overcomes some of those issues. Also, the annotation pipeline has further evolved and has become more refined.

By using gtdbtk for taxonomic classification of genes and transferring functional annotations from the NCBI, metaerg.py uses a controlled vocabulary for taxonomy and a relatively clean vocabulary for functions. This makes annotations much more concise than the original version of metaerg and many other annotation tools. In addition, metaerg uses NCBI's conserved domain database and RPSBlast to assign genes to subsystems for effective data exploration. Subsystems are a work in progress, and can be expanded and customized as needed.

The Metaerg 2.2 pipeline ...

  • predicts CRISPR regions using Minced.
  • predicts tRNAs using Aragorn.
  • predicts RNA genes and other non-coding features using Infernal - cmscan and RFAM.
  • predicts retrotransposons with LTR Harvest - LTRHarvest.
  • predicts tandem repeats with Tandem Repeats Finder.
  • predicts other repeat regions with Repeatscout and Repeatmasker.
  • predicts coding genes with Prodigal.
  • annotates taxonomy and functions of RNA and protein genes using Diamond, NCBI blastn and a database of 62,296 bacterial, 3,406 archaeal 11,569 viral and 139 eukaryotic genomes.
  • annotates gene functions using RPSBlast and NCBI's Conserved Domain Database (CDD).
  • annotates genes involved in production of secondary metabolites using Antismash.
  • annotates membrane amd translocated proteins using TMHMM and SignalP.
  • assigns genes to a built-in set of functions using HMMER and HMM profiles from MetaScan, HydDB and CANT-HYD.
  • presents annotations in datatables/jQuery-based intuititve, searchable, colorful HTML that can be explored in a web browser and copy/pasted into excel.
  • saves annotations in apache feather format for effective exploration, statistics and visualization with Jupyter or R.
  • enables the user to add custom HMMs and expand the set of functional genes as needed.

Usage:

metaerg --contig_file contig-file.fna --database_dir /path/to/metaerg-databases/

To annotate a set of genomes in a given dir (each file should contain the contigs of a single genome):

metaerg --contig_file dir-with-contig-files --database_dir /path/to/metaerg-databases/ --file_extension .fa

Metaerg needs ~40 min to annotate a 4 Mb genome on a desktop computer. There's a few more optional arguments, for a complete list, run:

metaerg -h

Installation

To install metaerg, its 18 helper programs (diamond, prodigal, etc.) and databases run the commands below. FIRST, you need to manually download signalp and tmhmm programs from here. Then:

python -m virtualenv metaerg-env
source metaerg-env/bin/activate
pip install --upgrade metaerg
metaerg --install_deps /path/to/bin_dir --database_dir /path/to/database_dir --path_to_signalp path/to/signalp.tar.gz \
  --path_to_tmhmm path/to/tmhmm.tar.gz
source /path/to/bin_dir/profile
metaerg --download_database --database_dir /path/to/metaerg-databases/

The database was created from the following sources:

  • gtdbtk is used for its taxonomy
  • NCBI annotations of >40K representative archael and bacterial genomes present in gtdb are sourced directly from the ncbi ftp server.
  • NCBI (refseq) annotations of viral genes are obtained from viral refseq.
  • For Eukaryotes, for each taxon within Amoebozoa, Ancyromonadida, Apusozoa, Breviatea, CRuMs, Cryptophyceae, Discoba, Glaucocystophyceae, Haptista, Hemimastigophora, Malawimonadida, Metamonada, Rhodelphea, Rhodophyta, Sar, Aphelida, Choanoflagellata, Filasterea, Fungi, Ichthyosporea, Rotosphaeridagenomes, one genome is added to the database using ncbi-datasets.
  • RFAM and CDD databases are also used.
  • Specialized function databases - Cant-Hyd and MetaScan.

If you for some reason need to build this database yourself (this is usually not needed as the metaerg database can be downloaded as shown above):

metaerg --create_database --database_dir /path/to/metaerg-databases/ --gtdbtk_dir /path/to/gtdbtk-database/ [--tasks [PVEBRC]]

with tasks:

  • P - build prokaryotes
  • V - build viruses
  • E - build eukaryotes
  • B - build PVE blast databases
  • R - build RFAM
  • C - build CDD
  • S - build specialized functional databases
  • A - build antismash database

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaerg-2.2.35.tar.gz (64.1 kB view hashes)

Uploaded Source

Built Distribution

metaerg-2.2.35-py3-none-any.whl (74.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page