Skip to main content

Summary and visualization tool for MAG gene annotation workflows

Project description

annoreport

A command-line tool for summarizing and visualizing gene annotation results from metagenome-assembled genome (MAG) workflows. Supports output from Prokka and Bakta, enriches results via the UniProt REST API, and produces a polished interactive HTML report alongside a TSV summary table.

Bioconda


Features

  • Auto-detects Prokka or Bakta output from directory contents
  • Counts and ranks the top N most common annotated gene products across all bins
  • Separates hypothetical proteins from annotated CDS and reports them independently
  • UniProt enrichment — looks up gene names and one-line function descriptions for each top gene product
  • Functional clustering — groups genes into biological categories (DNA Metabolism, Translation, Energy & Metabolism, Stress & Chaperones, etc.) based on UniProt keywords
  • Interactive HTML report with:
    • Summary stat cards (bins, contigs, assembly size, CDS counts, annotation rate, RNA features)
    • Feature type and RNA feature tables
    • Functional cluster cards with per-gene CDS counts
    • Functional category distribution bar chart
    • Searchable top-N gene product table
  • TSV output for downstream analysis in R, Python, or Excel
  • --no_uniprot flag for offline/fast runs — skips UniProt lookup and omits clustering sections

Requirements

  • Python 3.9+
  • No external dependencies — uses Python standard library only
  • Internet access required for UniProt enrichment (unless --no_uniprot is used)

Installation

Bioconda (recommended)

conda install -c bioconda annoreport

From source

git clone https://github.com/keplerridge/annoreport.git
cd annoreport

Or copy annotation_report.py directly into your project's scripts/ directory.


Usage

Basic (auto-detect tool)

python3 annotation_report.py \
    --annotation_dir results/prokka \
    --outdir results/annotation_summary

Bakta output

python3 annotation_report.py \
    --annotation_dir results/bakta \
    --outdir results/annotation_summary

Force tool type

python3 annotation_report.py \
    --annotation_dir results/bakta \
    --outdir results/annotation_summary \
    --tool bakta

Skip UniProt lookup (offline / fast mode)

python3 annotation_report.py \
    --annotation_dir results/prokka \
    --outdir results/annotation_summary \
    --no_uniprot

Change number of top genes reported

python3 annotation_report.py \
    --annotation_dir results/bakta \
    --outdir results/annotation_summary \
    --top_n 50

Arguments

Argument Default Description
--annotation_dir (required) Path to Prokka or Bakta output directory
--outdir annotation_summary Directory for output files
--top_n 100 Number of top gene products to report
--tool auto-detect Force tool type: prokka or bakta
--no_uniprot off Skip UniProt lookup; omits clustering and gene/function columns

Output

Two files are written to --outdir:

annotation_gene_summary.html

An interactive HTML report containing:

  • Summary cards — bins/MAGs, contigs, assembly size, total CDS, annotation rate, RNA features
  • Hypothetical protein callout — count and percentage of CDS with no known function
  • Feature type summary — counts of CDS, tRNA, rRNA, tmRNA, and other features
  • RNA features table — breakdown of non-coding RNA annotations
  • Functional clusters (UniProt mode only) — top genes grouped by biological function with CDS counts per gene
  • Functional category distribution (UniProt mode only) — bar chart of CDS counts per category
  • Top N gene products table — searchable table with gene product name, gene name, UniProt function description, CDS count, and percentage

annotation_gene_summary.tsv

Tab-separated summary with columns:

rank  count  percent_of_total_cds  product  gene_name  function  keywords

Plus a feature type summary appended at the bottom.


Supported Annotation Tools

Tool File Types Used Notes
Prokka .tsv, .gff Reads EC numbers and COG categories if present
Bakta .tsv, .gff3, .json Reads database cross-references; auto-skips hypotheticals.tsv and inference.tsv

Tool auto-detection checks for .gff3 or .json files (Bakta) versus .gff or .tsv only (Prokka).


Functional Categories

When UniProt lookup is enabled, genes are assigned to one of the following categories based on UniProt keyword matching (first match wins):

Category Example keywords
DNA Metabolism DNA replication, DNA repair, DNA-binding
Transcription Transcription, RNA-binding, Sigma factor
Translation & Ribosomes Protein biosynthesis, Ribosomal protein, Elongation factor
Energy & Metabolism ATP synthesis, Oxidoreductase, TCA cycle, Glycolysis
Transport & Membrane Transport, Membrane, ABC transporter, Porin
Stress & Chaperones Chaperone, Heat shock, Oxidative stress, Protease
Cell Division & Structure Cell division, Peptidoglycan, Cell wall
Signaling & Regulation Kinase, Two-component regulatory system, Signal transduction
Nucleotide Binding ATP-binding, GTP-binding, Isomerase, Hydrolase
Other / Unclassified No matching keywords found

Each gene is assigned to exactly one category. Priority follows the order above.


Example Workflow

This tool is designed to run after a MAG annotation step in a Snakemake workflow:

rule annotation_report:
    input:
        annotation_dir = 'results/bakta'
    output:
        html = 'results/annotation_summary/annotation_gene_summary.html',
        tsv  = 'results/annotation_summary/annotation_gene_summary.tsv'
    params:
        outdir = 'results/annotation_summary'
    conda:
        'envs/annotation_report.yaml'
    threads: 1
    resources:
        mem_mb=4000,
        runtime=30
    shell:
        """
        python3 scripts/annotation_report.py \
            --annotation_dir {input.annotation_dir} \
            --outdir {params.outdir} \
            --top_n 100
        """

Notes

  • UniProt queries use the reviewed (Swiss-Prot) database only for high-quality annotations
  • A 0.2 second delay is applied between UniProt API calls to respect rate limits
  • For 100 gene products, the UniProt lookup phase takes approximately 30–40 seconds
  • The --no_uniprot flag is recommended for quick runs or environments without internet access

License

MIT License — see LICENSE for details.


Citation

If you use this tool in your research, please cite it as:

annoreport: a summary and visualization tool for MAG annotation workflows. https://github.com/keplerridge/annoreport

annoreport: a summary and visualization tool for MAG annotation workflows. https://github.com/keplerridge/annoreport

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annoreport-0.1.1.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annoreport-0.1.1-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file annoreport-0.1.1.tar.gz.

File metadata

  • Download URL: annoreport-0.1.1.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for annoreport-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d320854dc682f65587f4b3923d7bfc23006fc8aa153823f34e025ad643285db3
MD5 34e0ada7ed9ac6b6abc5a6665e4584d4
BLAKE2b-256 eb1d98909e476171de0b89ece5abfa857bf2833f1f4c846d08b1ba8836f3959e

See more details on using hashes here.

File details

Details for the file annoreport-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: annoreport-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for annoreport-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c1c6131f4a712eafdd34bbef417b502f23f637c6fa15667efd72f81de568b6cd
MD5 723697e5dfdd41c12b93f8a7632ae628
BLAKE2b-256 0ba8657ebaac5cef24485a2848ff69daba0fd3544173da71114f51cb36fde1de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page