Skip to main content

Annotation of genomes and contigs

Project description

metaerg.py, version 2.5.9

Metaerg.py annotates genomes or sets of mags/bins from microbial ecosystems (bacteria, archaea, viruses). Input data consists of nucleotide fasta files, one per genome or mag, each with one or more contigs. Output files with annotations are in common formats such as .gff, .gbk, .fasta and .html with predicted genes, their functions and taxonomic classifications.

Metaerg 2.5 can annotate a group of related genomes and apply a suite of comparative genomics analyses. To do this, put contig fasta nucleotide files, one for each genome in a folder and run metaerg with the option "--mode comparative_genomics". Annotations become much stronger, richer and easier to interpret when you follow this approach. This analysis proceeds as follows:

  • Homologous proteins and RNA genes are clustered using mmseqs version 15-6f452 with --min-seq-id 0.5.
  • Orthologues and paralogues are called based on the distance to the center-protein.
  • Each cluster of homologous proteins is aligned with famsa version 2.2.2.
  • Median codon usage bias for each cluster of homologous proteins is calculated as a measure of expected expression level.
  • Median omega = ka/ks is calculated for each cluster to estimate whether the gene is under purifying or diversifying selection.
  • Based on the co-location of orthologous genes in different genomes, gene clusters are predicted.
  • These results are written in various formats (see below) and visualized for each gene in the interactive gene table of each genome.
  • A table with all the properties of each cluster of homologous genes is written (including representation for each genome).

By building its blast database off gtdbtk and transferring functional annotations from the NCBI, metaerg.py uses a controlled vocabulary for taxonomy and a relatively clean vocabulary for functions. This makes annotations much more concise than the original version of metaerg and many other annotation tools. In addition, metaerg uses NCBI's conserved domain database and RPSBlast to assign genes to subsystems for effective data exploration. Subsystems are a work in progress, and can be expanded and customized as needed.

The Metaerg 2.5 pipeline ...

  • predicts CRISPR regions using CRISPRDetect, version 2.4.
  • predicts tRNAs using Aragorn, version 1.2.41.
  • predicts RNA genes and other non-coding features using Infernal - cmscan and RFAM, version 1.1.4.
  • predicts retrotransposons with LTR Harvest - LTRHarvest, genometools version 1.6.2.
  • predicts tandem repeats with Tandem Repeats Finder, version 4.0.9.
  • predicts other repeat regions with Repeatscout, version 1.0.5, and Repeatmasker, version 4.1.5.
  • predicts coding genes with Prodigal, version 2.6.3.
  • annotates taxonomy and functions of RNA and protein genes using Diamond, version 2.0.15, NCBI blastn, version 2.14.0 and a database of >50,000 prokaryotes, based on gtdb version 214, 11,569 viral and 139 eukaryotic genomes.
  • annotates gene functions using RPSBlast, version 2.14.0 and NCBI's Conserved Domain Database (CDD).
  • annotates genes involved in production of secondary met abolites using Antismash, version 7.0.
  • annotates membrane and translocated proteins using PureseqTM and deepsig.
  • assigns genes to a built-in set of 1,001 functions using HMMER, version 3.3.2 and commmunity contributed HMM profiles (see below).
  • estimates doubling times of a genome's host based on codon usage bias
  • presents annotations in datatables/jQuery-based intuititve, searchable, colorful HTML that can be explored in a web browser and copy/pasted into excel.
  • saves annotations as a fasta-amino-acid file, a genbank file, and as a sqlite database for effective exploration, statistics and visualization with python or R.
  • saves an overview of all annotated genomes' properties and functions as an excel file.
  • enables the user to add custom HMMs and expand the set of functional genes as needed.

When using metaerg, please cite Xiaoli Dong and Marc Strous (2019) Frontiers in Genetics

Important changes in version 2.5

  • Comparative genomics mode was added. This depends on MMSeqs and Famsa. These programs are automatically installed by metaerg.

Important changes in version 2.4

  • Minced, TMHMM and SignalP are no longer used as helper programs.
  • CRISPRDetect, padloc, PureseqTM and deepsig are used instead. Installation is straightforward. Either use the --install_deps option, run the commands in "installation.py" for a manual install or use Docker.
  • These changes reduce overall runtime.
  • Various small bugs and inconveniences were fixed .
  • If you have data previously annotated with metaerg, use --update_annotations to update.

Usage:

>metaerg --contig_file contig-file.fna --database_dir /path/to/metaerg-databases/

To annotate a set of genomes in a given dir (each file should contain the contigs of a single genome):

>metaerg --contig_file dir-with-contig-files --database_dir /path/to/metaerg-databases/ \
--file_extension .fa

Metaerg needs ~40 min to annotate a 4 Mb genome on a desktop computer. The comparative genomics analysis requires about 20 min extra (in total) for a ls set of ~20 genomes.

You can use the following arguments when running metaerg:

--contig_file           A nucleotide fasta files with contigs to be annotated OR a dir containing
                        nucleotide fasta files.
--output_dir            The path where the results will be saved. Default: the dir that contains the
                        contig file(s)                      
--database_dir          The path to the metaerg database.
--file_extension        If a dir was provided to --contig_file, the file extension of the 
                        nucleotide fasta files. Default: .fna
--rename_contigs        Assemblers may create very long names for contigs, which is suboptimal for
                        presentation of results. This argument will make metaerg rename the 
                        contigs more concisely.
--rename_genomes        Binners may create very long names for bins/MAGs, which is suboptimal for
                        presentation of results. This argument will make metaerg rename the 
                        bins/MAGs more concisely. When using this option, contigs will also be
                        renamed.
--delimiter             Metaerg will create "locus tags" (unique IDs) for genes with according to
                        the following scheme: "[MAG file name].[CONTIG name].[Gene number]". 
                        By default, '.' is used as the delimiter separating the three parts of the
                        ID, as shown in the example. If you want to use a different character to
                        separate the parts, use this argument. If your custom character is 
                        detected in filenames or contig names, and you are not renaming contigs 
                        or MAGs, metaerg will terminate with an error message.
--prefix                When renaming genomes, by default genomes will be named "gXXXX", where 'g'
                        is the prefix and XXXX is a number. If you would like a different prefix, 
                        use this argument.
--min_contig_length     Only annotate contigs that are longer than the specified length. 
                        Default: 0.
--cpus                  Number of threads used for annotation. Default: threads available 
                        on the system / 2.
--mode                  Use "--mode contig" to annotate contigs individually instead of assuming they
                        are part of a genome, MAG or bin. When using this option, metaerg will not
                        run repeatscout and will run prodigal in metagenome mode. Use the option
                        --translation_table below to override using metagenome mode.
                        Use "--mode comparative_genomics" to annotate a clade of related genomes/MAGs.
                        Use "--mode genome" to annotate genomes/MAGs, one per file.
                        Default: "genome".
--force                 Overwrite previous results. By default, results of previous steps will be
                        kept. You need to specify which steps will be forced (see --skip for a list
                        of steps Use --force all to overwrite all previous results.
--update_annotations    Do not rerun any helper programs (keep previous results) but redo all the
                        data processing. Use this option for example after you updated metaerg.
--skip                  Use this argument to skip one or more annotation steps. Use the following
                        names for the steps, with names to be skipped separated by a comma (,):
                        
                        antismash           Annotate genes involved in production of secondary 
                                            metabolites and other aspects of the "interactome".
                        aragorn             Call transfer RNAs.
                        cdd                 Annotate gene functions according to NCBI's conserved
                                            domain database.
                        cmscan              Call RNA genes (such as rRNA genes).
                        deepsig             Detect protein signal peptides for translocation across
                                            the membrane.
                        diamond_and_blastn  Annotate gene functions and classify genes
                                            taxonomically based on homology to genes of other
                                            organisms. 
                        hmm                 Annotate gene functions according to metaergs built-in 
                                            scheme.
                        ltr_harvest         Call retrotransposons.
                        crispr_detect       Call CRSIPR repeats.
                        padloc              Annotate genes involved in cellular defense
                        prodigal            Call open reading frames (genes encoding proteins). If
                                            you skip this step, no proteins will be annotated.
                        pureseqtm           Annotate transmembrane helixes (membrane proteins and
                                            anchors).
                        repeat_masker       Call any repeat sequences.
                        trf                 Call tandem repeats.
                          

--translation_table     Translation table(s) to be used when calling open reading frames with 
                        prodigal. Default: 11,25. If the mean ORF length is less than 200 amino-
                        acids, metaerg will rerun prodigal with the next translation table in the
                        list.
--download_database     Use this argument to download and install the prebuilt metaerg database.
                        CAUTION: The metaerg databases are big, requiring approximately 165 Gb of 
                        disk space.
--create_database       Use this argument to build the metaerg database from scratch. The metaerg
                        database consists of several components. By default, this argument will
                        build all. If you wish to build specific database components, use one or
                        more of the following letters:
                        
                        P - build prokaryotes
                        V - build viruses
                        E - build eukaryotes
                        B - format all (PVE) blast databases
                        R - build RFAM
                        C - build CDD
                        S - build/update community contributed HMM databases
                        A - build antismash database
                        D - build padloc database

--checkm_dir            If you have previously used checkm or checkm2 to determine the quality of
                        the MAGs/bins, you can specify the dir with the checkm or checkm2 results 
                        here, and metaerg will integrate the estimated completeness and
                        contamination into its output.
--gtdbtk_dir            Use this argument with --create_database to point metaerg to the gtdbtk
                        database. It needs this to build its prokaryote blast database.
--install_deps          Use this argument to install all helper programs on your system. You need
                        to follow this argument with an installation dir, where you want to have
                        the programs installed.
--padloc_database       Use optionally with --install_deps if you want this database to be in a
                        non-default place/filesystem. Afterward, use --create_database D to
                        actually download and install the padloc database.
--antismash_database    Use optionally with --install_deps if you want this database to be in a
                        non-default place/filesystem. Afterward, use --create_database A to
                        actually download and install the antismash database.

Using the Docker Image

Metaerg depends on many helper programs and may require some time and troubleshooting to install. To avoid these issues, use the docker image. Alternatively, use singularity or apptainer to run the docker image on a HPC, as explained by jkzorz and [lianchun yi]:

>singularity pull docker://kinestetika/metaerg
>singularity build --sandbox /path/where/top/create/metaerg_latest.sif docker://kinestetika/metaerg:latest
>singularity exec --bind /path/to/metaerg_database:/databases --bind /path/to/fasta/files:/data --writable /path/to/the/sandbox/dir
metaerg --database_dir /databases --contig_file /data --file_extension .fna

[!NOTE] The --bind instruction binds the dir "/path/to/metaerg_database" (which is on your real filesystem) to the dir "/databases" (which is inside your container). Same for the contig dir.

Or, using apptainer:

>apptainer build metaerg.sif docker://kinestetika/metaerg:latest

Installation

To install metaerg, its 19 helper programs (diamond, prodigal, etc.) and databases run the following commands:

>python -m virtualenv metaerg-env
>source metaerg-env/bin/activate
>pip install --upgrade metaerg
>metaerg --install_deps /path/to/bin_dir
>source /path/to/bin_dir/profile
>metaerg --download_database --database_dir /path/to/metaerg-databases/

If you install metaerg's helper programs this way, before running metaerg you need to run the following, to prepend the helper programs to your path:

>source /path/to/bin_dir/profile

Database

The database was created from the following sources:

  • gtdbtk is used for its taxonomy
  • NCBI annotations of >40K representative archael and bacterial genomes present in gtdb are sourced directly from the ncbi ftp server.
  • NCBI (refseq) annotations of viral genes are obtained from viral refseq.
  • For Eukaryotes, for each taxon within Amoebozoa, Ancyromonadida, Apusozoa, Breviatea, CRuMs, Cryptophyceae, Discoba, Glaucocystophyceae, Haptista, Hemimastigophora, Malawimonadida, Metamonada, Rhodelphea, Rhodophyta, Sar, Aphelida, Choanoflagellata, Filasterea, Fungi, Ichthyosporea, Rotosphaeridagenomes, one genome is added to the database using ncbi-datasets.
  • RFAM and CDD databases are also used.
  • Specialized function databases - Cant-Hyd and MetaScan.

Community contributed HMM profiles are sourced from:

If you for some reason need to build the database yourself (this is usually not needed as the metaerg database can be downloaded as shown above):

>metaerg --create_database --database_dir /path/to/metaerg/db --gtdbtk_dir /path/to/gtdbtkdir

Accessing the .feather and .sqlite files

Apache Feather format is a binary file format for tables. Sqlite is a database format. You can for example load these data as a pandas dataframe. In R, use the arrow package. Each table/database row contains a single gene or feature, defined by the following columns:

id                  the feature's unique identifier
genome              the identifier of the genome the feature belongs to
contig              the identifier of the contig the feature belongs to
start               the start position of the feature (inclusive)
end                 the start position of the feature (exclusive)
strand              the strand (0 or 1 for + or - respectively)
type                the type of feature (for example CDS, rRNA, tRNA, ncRNA, retrotransposon)
inference           the program used to infer the feature (for example prodigal for CDS)
parent              the feature's parent (for example a CRISPR repeat has a repeat_region as a parent)
subsystems          the subsystems (functional genes) the feauture is part of 
                    (for example "[ATP synthase|ATP synthase, subunit F0 B]")  
descr               a succint description of the annotated function
taxon               the taxon of the top blast hit
notes               any other info (rarely used)
seq                 the sequence of the feature (AA for CDS, otherwise NT)
antismash           the function assigned by antismash, if any
signal_peptide      the type of signal peptide found, if any.
tmh                 the number of transmembrane helixes found
tmh_topology        the locations of predicted transmembrane helixes in the protein sequence
blast               the top ten blast hits
cdd                 the top ten cdd hits
hmm                 the top ten hits to the functional gene hmm database 

You can for example use python and pandas to inspect annotations:

from pathlib import Path
import pandas as pd

data_dir = Path('/path/to/my/data')
feather_file = data_dir / 'my-genome.annotations.feather'
contig_file =  data_dir / 'my-genome.fna'

contig_dict = load_contigs('my-genome', contig_file, delimiter='xxxx')
feature_data = pd.read_feather(feather_file)
feature_data.set_index('id', inplace=True)

for f in feature_data.itertuples():
    for k, v in f._asdict().items():
        print(f'{k:20}:{v}')
    break  # comment out to iterate through all the genes...

Using the .sqlite database is even easier:

from pathlib import Path
from metaerg.datatypes import sqlite

data_dir = Path('/path/to/my/data')
sqlite_file = data_dir / 'my-genome.annotations.sqlite'

db_connection = sqlite.connect_to_db(sqlite_file)
for feature in sqlite.read_all_features(db_connection): 
    print(feature)
    break  # comment out to iterate through all the genes...

How Metaerg's functional gene calling works

Metaerg contains a data file that defines metabolic modules, each consisting of multiple genes. These genes are identified using NCBI's Conserved Domain Database (cdd) and a custom set of Hidden Markov Models contributed by the community/science literature. Here's what an example module looks like in the data file:

>Respiration, Hydrogenases Ni-Fe
COG3259|TIGR03295|NF033181|COG4042 NiFe hydrogenase catch-all
WP_015407385 NiFe hydrogenase group Ia
WP_012675774 NiFe hydrogenase group Ib
WP_011688559 NiFe hydrogenase group Ic
WP_023457248 NiFe hydrogenase group Id
WP_008030254 NiFe hydrogenase group Ie
WP_015898101 NiFe hydrogenase group If
WP_011822511 NiFe hydrogenase group Ig
WP_014267363 NiFe hydrogenase group Ih
WP_022740096 NiFe hydrogenase group Ii
WP_010878877 NiFe hydrogenase group Ij
WP_012036281 NiFe hydrogenase group Ik
WP_011603008 NiFe hydrogenase group IIa
WP_011714112 NiFe hydrogenase group IIb
WP_011383527.1 NiFe hydrogenase group IIc
WP_010880487 NiFe hydrogenase group IId
WP_012020882.1 NiFe hydrogenase group IIe
WP_022846160 NiFe hydrogenase group IIIa
WP_011719855 NiFe hydrogenase group IIIb
WP_013637988 NiFe hydrogenase group IIIc
WP_011686721 NiFe hydrogenase group IIId
WP_022739022 NiFe hydrogenase group IVa
WP_011388069 NiFe hydrogenase group IVb
WP_026783288 NiFe hydrogenase group IVc
WP_014122750 NiFe hydrogenase group IVd
WP_022739883 NiFe hydrogenase group IVe
WP_018133527 NiFe hydrogenase group IVf
WP_013129492 NiFe hydrogenase group IVg
WP_011018833 NiFe hydrogenase group IVh
WP_004029221 NiFe hydrogenase group IVi

A metabolic module starts with a greater-than (>) sign followed by the name of the module. Each line after lists one gene. The gene is defined by a one or more cdd or custom-hmm identifiers separated by or (|). After the identifiers follows a space and finally the name of the gene (which can contain spaces). The example shows the Ni-Fe hydrogenase module. This module contains "catch-all" hydrogenase domains from the cdd database (as the first gene on the first line) as well as custom hmm's targeting specific subtypes of Ni-Fe hydrogenase. These hmm's were created by the Greening Lab.

For cdd domains, metaerg will identify a protein as the gene when one of the cdd domains is among the top 5 cdd hits. For custom hmm's, metaerg will identify a protein as the gene if the hmm profile is the top hit and al least 70% of the profile hmm aligns to the gene. If an hmm profile defines a trusted cutoff score, metaerg will use that score as the cutoff.

The advantage of using cdd compared to custom hmm databases is that cdd contains profiles for most known genes. So if, for example, a protein resembles a hydrogenase, such as subunit nuoD of the respiratory complex I, cdd has no difficulty identifying nuoD as nuoD. However, the custom hmm database does not contain a nuoD profile, so nuoD genes, present in many bacterial genomes, will be erroneously identified as a NiFe hydrogenases if you only use custom profiles. This problem cannot be overcome by using a fixed e-value or score cutoff, as each hmm profile produces a unique range of evalues and scores. In theory, this problem could be overcome by determining a trusted cutoff score for each profile. However, that is a lot of work and not commonly done when scientists create hmm databases.

That being said, cdd is not good enough by itself, because it misses out on many important microbiome functions and
genes. Some genes are completely absent, and some are poorly defined. Expert input is needed for correct identification of these genes. That is why we need custom hmms.

Metaerg's solution to this problem is to combine cdd with custom hmms. In the example, if a gene is not captured by the catch-all cdd hydrogenase modules, it is likely not a NiFe hydrogenase. If it is, the custom hmm's will show the subtype of the NiFe hydrogenase.

How to add your own custom functional gene database and HMMs

  1. Locate your metaerg database dir. Inside the metaerg database dir, is a dir "hmm".
  2. Within "hmm", create a new dir named "user_config", if it does not already exist.
  3. Inside "user_config", you can create one or more files with the extension ".config.txt".
  4. These files should be formatted like in the example above.
  5. Each module should start with >.
  6. Each gene starts with one or more cdd or custom hmm names, followed by a space and the gene name/description.
  7. To query the cdd database for relevant cdd names, you use grep with the file "/path/to/metaerg/database/cdd/cddid.tbl."
  8. Within "hmm", create a new dir named "user_hmm", if it does not already exist.
  9. Add your custom hmm files to this dir.
  10. Update the metaerg database using the following command:
>metaerg --create_database S --database_dir /path/to/metaerg/db 
  1. Only those hmm profiles listed in your config file will actually be built into to the metaerg database

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaerg-2.5.9.tar.gz (117.6 kB view details)

Uploaded Source

Built Distribution

metaerg-2.5.9-py3-none-any.whl (127.4 kB view details)

Uploaded Python 3

File details

Details for the file metaerg-2.5.9.tar.gz.

File metadata

  • Download URL: metaerg-2.5.9.tar.gz
  • Upload date:
  • Size: 117.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for metaerg-2.5.9.tar.gz
Algorithm Hash digest
SHA256 d3d6be1e5126ffbf5b6e3462e7a06295c952027c85f8a475ca2be6d99c01e974
MD5 3ef414f5e9814a705bd9c4019eeadbf7
BLAKE2b-256 dac096f1cce06193ca6094592133cb7f96e7955424f9843d7484c336e6d01682

See more details on using hashes here.

File details

Details for the file metaerg-2.5.9-py3-none-any.whl.

File metadata

  • Download URL: metaerg-2.5.9-py3-none-any.whl
  • Upload date:
  • Size: 127.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for metaerg-2.5.9-py3-none-any.whl
Algorithm Hash digest
SHA256 a2da07923089eede54d123e708d315f3b91ecc216847bd1570afa84c46f58f55
MD5 b3880e6147a3d771e3c943030db78d71
BLAKE2b-256 6b832d39b14fc57007c90cab467d3275fb34eed5aeacc84216f9c25d10e74e36

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page