Summary and visualization tool for MAG gene annotation workflows

These details have not been verified by PyPI

Project links

Homepage

Project description

annoreport

A command-line tool for summarizing and visualizing gene annotation results from metagenome-assembled genome (MAG) workflows. Supports output from Prokka and Bakta, enriches results via the UniProt REST API, and produces a polished interactive HTML report alongside a TSV summary table.

Features

Auto-detects Prokka or Bakta output from directory contents
Counts and ranks the top N most common annotated gene products across all bins
Separates hypothetical proteins from annotated CDS and reports them independently
UniProt enrichment — looks up gene names and one-line function descriptions for each top gene product
Functional clustering — groups genes into biological categories (DNA Metabolism, Translation, Energy & Metabolism, Stress & Chaperones, etc.) based on UniProt keywords
Interactive HTML report with:
- Summary stat cards (bins, contigs, assembly size, CDS counts, annotation rate, RNA features)
- Feature type and RNA feature tables
- Functional cluster cards with per-gene CDS counts
- Functional category distribution bar chart
- Searchable top-N gene product table
TSV output for downstream analysis in R, Python, or Excel
--no_uniprot flag for offline/fast runs — skips UniProt lookup and omits clustering sections

Requirements

Python 3.9+
No external dependencies — uses Python standard library only
Internet access required for UniProt enrichment (unless --no_uniprot is used)

Installation

Bioconda (recommended)

conda install -c bioconda annoreport

PyPI

pip install annoreport

From source

git clone https://github.com/keplerridge/annoreport.git
cd annoreport

Or copy annotation_report.py directly into your project's scripts/ directory.

Usage

Basic (auto-detect tool)

python3 annotation_report.py \
    --annotation_dir results/prokka \
    --outdir results/annotation_summary

Bakta output

python3 annotation_report.py \
    --annotation_dir results/bakta \
    --outdir results/annotation_summary

Force tool type

python3 annotation_report.py \
    --annotation_dir results/bakta \
    --outdir results/annotation_summary \
    --tool bakta

Skip UniProt lookup (offline / fast mode)

python3 annotation_report.py \
    --annotation_dir results/prokka \
    --outdir results/annotation_summary \
    --no_uniprot

Change number of top genes reported

python3 annotation_report.py \
    --annotation_dir results/bakta \
    --outdir results/annotation_summary \
    --top_n 50

Arguments

Argument	Default	Description
`--annotation_dir`	(required)	Path to Prokka or Bakta output directory
`--outdir`	`annotation_summary`	Directory for output files
`--top_n`	`100`	Number of top gene products to report
`--tool`	auto-detect	Force tool type: `prokka` or `bakta`
`--no_uniprot`	off	Skip UniProt lookup; omits clustering and gene/function columns

Output

Two files are written to --outdir:

`annotation_gene_summary.html`

An interactive HTML report containing:

Summary cards — bins/MAGs, contigs, assembly size, total CDS, annotation rate, RNA features
Hypothetical protein callout — count and percentage of CDS with no known function
Feature type summary — counts of CDS, tRNA, rRNA, tmRNA, and other features
RNA features table — breakdown of non-coding RNA annotations
Functional clusters (UniProt mode only) — top genes grouped by biological function with CDS counts per gene
Functional category distribution (UniProt mode only) — bar chart of CDS counts per category
Top N gene products table — searchable table with gene product name, gene name, UniProt function description, CDS count, and percentage

`annotation_gene_summary.tsv`

Tab-separated summary with columns:

rank  count  percent_of_total_cds  product  gene_name  function  keywords

Plus a feature type summary appended at the bottom.

Supported Annotation Tools

Tool	File Types Used	Notes
Prokka	`.tsv`, `.gff`	Reads EC numbers and COG categories if present
Bakta	`.tsv`, `.gff3`, `.json`	Reads database cross-references; auto-skips `hypotheticals.tsv` and `inference.tsv`

Tool auto-detection checks for .gff3 or .json files (Bakta) versus .gff or .tsv only (Prokka).

Functional Categories

When UniProt lookup is enabled, genes are assigned to one of the following categories based on UniProt keyword matching (first match wins):

Category	Example keywords
DNA Metabolism	DNA replication, DNA repair, DNA-binding
Transcription	Transcription, RNA-binding, Sigma factor
Translation & Ribosomes	Protein biosynthesis, Ribosomal protein, Elongation factor
Energy & Metabolism	ATP synthesis, Oxidoreductase, TCA cycle, Glycolysis
Transport & Membrane	Transport, Membrane, ABC transporter, Porin
Stress & Chaperones	Chaperone, Heat shock, Oxidative stress, Protease
Cell Division & Structure	Cell division, Peptidoglycan, Cell wall
Signaling & Regulation	Kinase, Two-component regulatory system, Signal transduction
Nucleotide Binding	ATP-binding, GTP-binding, Isomerase, Hydrolase
Other / Unclassified	No matching keywords found

Each gene is assigned to exactly one category. Priority follows the order above.

Example Workflow

This tool is designed to run after a MAG annotation step in a Snakemake workflow:

rule annotation_report:
    input:
        annotation_dir = 'results/bakta'
    output:
        html = 'results/annotation_summary/annotation_gene_summary.html',
        tsv  = 'results/annotation_summary/annotation_gene_summary.tsv'
    params:
        outdir = 'results/annotation_summary'
    conda:
        'envs/annotation_report.yaml'
    threads: 1
    resources:
        mem_mb=4000,
        runtime=30
    shell:
        """
        python3 scripts/annotation_report.py \
            --annotation_dir {input.annotation_dir} \
            --outdir {params.outdir} \
            --top_n 100
        """

Notes

UniProt queries use the reviewed (Swiss-Prot) database only for high-quality annotations
A 0.2 second delay is applied between UniProt API calls to respect rate limits
For 100 gene products, the UniProt lookup phase takes approximately 30–40 seconds
The --no_uniprot flag is recommended for quick runs or environments without internet access

License

MIT License — see LICENSE for details.

Citation

If you use this tool in your research, please cite it as:

annoreport: a summary and visualization tool for MAG annotation workflows. https://github.com/keplerridge/annoreport

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

May 26, 2026

0.1.1

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annoreport-0.1.2.tar.gz (15.3 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

annoreport-0.1.2-py3-none-any.whl (15.7 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file annoreport-0.1.2.tar.gz.

File metadata

Download URL: annoreport-0.1.2.tar.gz
Upload date: May 26, 2026
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for annoreport-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0f8e4c0ecc223937723bf0094fd090761d2925cb70b5ed4099d34cf8365f3f39`
MD5	`5b216daabce4ed03c39d3542641b0405`
BLAKE2b-256	`596deed4fc31e60f9481efe285c858243c9ae775d45f895a2c0f77b989c38f39`

See more details on using hashes here.

File details

Details for the file annoreport-0.1.2-py3-none-any.whl.

File metadata

Download URL: annoreport-0.1.2-py3-none-any.whl
Upload date: May 26, 2026
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for annoreport-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8ee68f269190fb8b9598eb2dd9bab5dbeced534ad369a774ddde6e49e11c404`
MD5	`baabf9aa86759fa6c101d9f22d52a038`
BLAKE2b-256	`38dd23cc4db14dbbf5c531ab995ae750bcdb3266e7fd9e90149ba68e39ebe9a7`

See more details on using hashes here.

annoreport 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

annoreport

Features

Requirements

Installation

Bioconda (recommended)

PyPI

From source

Usage

Basic (auto-detect tool)

Bakta output

Force tool type

Skip UniProt lookup (offline / fast mode)

Change number of top genes reported

Arguments

Output

annotation_gene_summary.html

annotation_gene_summary.tsv

Supported Annotation Tools

Functional Categories

Example Workflow

Notes

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`annotation_gene_summary.html`

`annotation_gene_summary.tsv`