Skip to main content

Build .gbk files starting from eggnog annotation files and genomes (fasta)

Project description

emapper2gbk: creation of genbank files from Eggnog-mapper annotation outputs

Starting from fasta and Eggnog-mapper annotation files, build a gbk file that is suitable for metabolic network reconstruction with Pathway Tools. Adds the GO terms and EC numbers annotations in the genbank file. The program can be run with or without a .gff file. There are two main modes:

  • genomic mode: usually when focusing on a single organism, with a .gff file (but can be without). The creation of genbanks can be performed in parallel by providing directories (with matching names for genomes, proteomes and annotation files) as inputs

  • metagenomic mode: suitable when a list of isolated genes/proteins have been annotated with Eggnog-mapper, typically the gene catalogue of a metagenome. In that case, there is one annotation file and multiple genomes/proteomes that will contain a subset of the genes present in the annotated gene catalogue. The purpose is to create a genbank file with the genes and annotations that matches each genome (Metagenomic species core genome, Metagenomic-assembled genome…) content. There will not be any .gff associated to the genomes, and the creation of genbanks can be performed in parallel by providing directories (with matching names for genome and proteome files) as inputs. In that case, you can give as organism name “metagenome” or “bacteria”.

Main inputs

For each annotated genome, inputs consist of (but are not limited to):

  • a nucleotide fasta file: containing the CDS sequence of each genes

  • the translated sequences in fasta

  • the annotation file obtained after Eggnog-mapper annotation (usually xxx.emapper.annotation)

  • the name of the considered organism (can be “bacteria” or “metagenome”) or a file with organisms names (matching the genomes names).

In addition, as optional files:

  • a gff file: containing the position of each gene on the genome

  • the number of available cores for multiprocessing (when working on multiple genomes)

  • a go-basic file of GO ontology (if not given, emapper2gbk will download a copy and use it)

Dependencies and installation

Dependencies

All are described in requirements.txt and can be installed with pip install -r requirements.txt.

  • biopython

  • gffutils

  • numpy

  • pandas

  • pronto

  • requests

Install

  • From this cloned repository

pip install -r requirements.txt
pip install .
  • From Pypi

pip install emapper2gbk

Usage

Convert GFF, fastas, annotation table and species name into Genbank. usage: emapper2gbk [-h] [-v] {genomic,metagenomic} ...

Two modes: genomic (one genome/proteome/gff/annot file –> one gbk) or metagenomic with the annotation of the full gene catalogue and fasta files (proteome/genomes) corresponding to list of genes.

Genomic mode can be used with or without gff files making it suitable to build a gbk from a list of genes and their annotation.

You can give the GO ontology as an input to the program, it will be otherwise downloaded during the run. You can download it here: http://purl.obolibrary.org/obo/go/go-basic.obo . The program requests the NCBI database to retrieve taxonomic information of the organism. However, if the organism is “bacteria” or “metagenome”, the taxonomic information will not have to be retrieved online. Hence, if you need to run the program from a cluster with no internet access, it is possible for a “bacteria” or “metagenome” organism, and by providing the GO-basic.obo file (that you can download before using emapper2gbk).

For specific help on each subcommand use: emapper2gbk {cmd} --help

optional arguments:
-h, --help

show this help message and exit

-v, --version

show program’s version number and exit

subcommands:

valid subcommands:

{genomic,metagenomic}

genomic genomic mode : 1 annot, 1 faa, 1 fna, [1 gff] –> 1 gbk

metagenomic metagenomic mode : 1 annot, n faa, n fna –> n gb

  • Genomic mode

    • Usage

      usage: emapper2gbk genomic [-h] -fg FASTAGENOME -fp FASTAPROT [-g GFF] -o
                                  OUPUT_DIR -a ANNOTATION [-c CPU] [-n NAME]
                                  [-nf NAMEFILE] [-go GOBASIC] [-q]
      
      Build a gbk file for each genome/set of genes with an annotation file for each
      
      optional arguments:
        -h, --help            show this help message and exit
        -fg FASTAGENOME, --fastagenome FASTAGENOME
                              fna file or directory
        -fp FASTAPROT, --fastaprot FASTAPROT
                              faa file or directory
        -g GFF, --gff GFF     gff file or directory
        -o OUPUT_DIR, --out OUPUT_DIR
                              output directory/file path
        -a ANNOTATION, --annotation ANNOTATION
                              eggnog annotation file or directory
        -c CPU, --cpu CPU     cpu number for metagenomic mode or genome mode using
                              input directories
        -n NAME, --name NAME  organism/genome name in quotes
        -nf NAMEFILE, --namefile NAMEFILE
                              organism/genome name (col 2) associated to genome file
                              basenames (col 1). Default = 'metagenome' for
                              metagenomic and 'cellular organisms' for genomic
        -go GOBASIC, --gobasic GOBASIC
                              go ontology, GOBASIC is either the name of an existing
                              file containing the GO Ontology or the name of the
                              file that will be created by emapper2gbk containing
                              the GO Ontology
        -q, --quiet           quiet mode, only warning, errors logged into console
    • Examples

      • Genomic - single mode

      emapper2gbk genomic -fg genome.fna -fp proteome.faa [-gff genome.gff] -n "Escherichia coli" -o coli.gbk -a eggnog_annotation.tsv [-go go-basic.obo]
      • Genomic - multiple mode, “bacteria” as default name

      emapper2gbk genomic -fg genome_dir/ -fp proteome_dir/ [-gff gff_dir/] -n bacteria -o gbk_dir/ -a eggnog_annotation_dir/ [-go go-basic.obo]
      • Genomic - multiple mode, tsv file for organism names

      emapper2gbk genomic -fg genome_dir/ -fp proteome_dir/ [-gff gff_dir/] -nf matching_genome_orgnames.tsv -o gbk_dir/ -a eggnog_annotation_dir/ [-go go-basic.obo]
  • Metagenomic mode

    • Usage

      usage: emapper2gbk metagenomic [-h] -fg FASTAGENOME -fp FASTAPROT [-g GFF] -o
                                      OUPUT_DIR [-nf NAMEFILE] [-n NAME] -a
                                      ANNOTATION [-c CPU] [-go GOBASIC] [-q]
      
      Use the annotation of a complete gene catalogue and build gbk files for each
      set of genes (fna) and proteins (faa) from input directories
      
      optional arguments:
        -h, --help            show this help message and exit
        -fg FASTAGENOME, --fastagenome FASTAGENOME
                              fna file or directory
        -fp FASTAPROT, --fastaprot FASTAPROT
                              faa file or directory
        -g GFF, --gff GFF     gff file or directory
        -o OUPUT_DIR, --out OUPUT_DIR
                              output directory/file path
        -nf NAMEFILE, --namefile NAMEFILE
                              organism/genome name (col 2) associated to genome file
                              basenames (col 1). Default = 'metagenome' for
                              metagenomic and 'cellular organisms' for genomic
        -n NAME, --name NAME  organism/genome name in quotes
        -a ANNOTATION, --annotation ANNOTATION
                              eggnog annotation file or directory
        -c CPU, --cpu CPU     cpu number for metagenomic mode or genome mode using
                              input directories
        -go GOBASIC, --gobasic GOBASIC
                              go ontology, GOBASIC is either the name of an existing
                              file containing the GO Ontology or the name of the
                              file that will be created by emapper2gbk containing
                              the GO Ontology
        -q, --quiet           quiet mode, only warning, errors logged into console
    • Example

      emapper2gbk metagenomic -fg genome_dir/ -fp proteome_dir/ -o gbk_dir/ -a gene_cat_ggnog_annotation.tsv [-go go-basic.obo]

History

0.0.6 (2020-06-24)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emapper2gbk-0.0.6.tar.gz (4.3 MB view hashes)

Uploaded Source

Built Distribution

emapper2gbk-0.0.6-py2.py3-none-any.whl (19.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page