Summary and visualization tool for MAG gene annotation workflows
Project description
annoreport
A command-line tool for summarizing and visualizing gene annotation results from metagenome-assembled genome (MAG) workflows. Supports output from Prokka and Bakta, enriches results via the UniProt REST API, and produces a polished interactive HTML report alongside a TSV summary table.
Features
- Auto-detects Prokka or Bakta output from directory contents
- Counts and ranks the top N most common annotated gene products across all bins
- Separates hypothetical proteins from annotated CDS and reports them independently
- UniProt enrichment — looks up gene names and one-line function descriptions for each top gene product
- Functional clustering — groups genes into biological categories (DNA Metabolism, Translation, Energy & Metabolism, Stress & Chaperones, etc.) based on UniProt keywords
- Interactive HTML report with:
- Summary stat cards (bins, contigs, assembly size, CDS counts, annotation rate, RNA features)
- Feature type and RNA feature tables
- Functional cluster cards with per-gene CDS counts
- Functional category distribution bar chart
- Searchable top-N gene product table
- TSV output for downstream analysis in R, Python, or Excel
--no_uniprotflag for offline/fast runs — skips UniProt lookup and omits clustering sections
Requirements
- Python 3.9+
- No external dependencies — uses Python standard library only
- Internet access required for UniProt enrichment (unless
--no_uniprotis used)
Installation
Bioconda (recommended)
conda install -c bioconda annoreport
From source
git clone https://github.com/keplerridge/annoreport.git
cd annoreport
Or copy annotation_report.py directly into your project's scripts/ directory.
Usage
Basic (auto-detect tool)
python3 annotation_report.py \
--annotation_dir results/prokka \
--outdir results/annotation_summary
Bakta output
python3 annotation_report.py \
--annotation_dir results/bakta \
--outdir results/annotation_summary
Force tool type
python3 annotation_report.py \
--annotation_dir results/bakta \
--outdir results/annotation_summary \
--tool bakta
Skip UniProt lookup (offline / fast mode)
python3 annotation_report.py \
--annotation_dir results/prokka \
--outdir results/annotation_summary \
--no_uniprot
Change number of top genes reported
python3 annotation_report.py \
--annotation_dir results/bakta \
--outdir results/annotation_summary \
--top_n 50
Arguments
| Argument | Default | Description |
|---|---|---|
--annotation_dir |
(required) | Path to Prokka or Bakta output directory |
--outdir |
annotation_summary |
Directory for output files |
--top_n |
100 |
Number of top gene products to report |
--tool |
auto-detect | Force tool type: prokka or bakta |
--no_uniprot |
off | Skip UniProt lookup; omits clustering and gene/function columns |
Output
Two files are written to --outdir:
annotation_gene_summary.html
An interactive HTML report containing:
- Summary cards — bins/MAGs, contigs, assembly size, total CDS, annotation rate, RNA features
- Hypothetical protein callout — count and percentage of CDS with no known function
- Feature type summary — counts of CDS, tRNA, rRNA, tmRNA, and other features
- RNA features table — breakdown of non-coding RNA annotations
- Functional clusters (UniProt mode only) — top genes grouped by biological function with CDS counts per gene
- Functional category distribution (UniProt mode only) — bar chart of CDS counts per category
- Top N gene products table — searchable table with gene product name, gene name, UniProt function description, CDS count, and percentage
annotation_gene_summary.tsv
Tab-separated summary with columns:
rank count percent_of_total_cds product gene_name function keywords
Plus a feature type summary appended at the bottom.
Supported Annotation Tools
| Tool | File Types Used | Notes |
|---|---|---|
| Prokka | .tsv, .gff |
Reads EC numbers and COG categories if present |
| Bakta | .tsv, .gff3, .json |
Reads database cross-references; auto-skips hypotheticals.tsv and inference.tsv |
Tool auto-detection checks for .gff3 or .json files (Bakta) versus .gff or .tsv only (Prokka).
Functional Categories
When UniProt lookup is enabled, genes are assigned to one of the following categories based on UniProt keyword matching (first match wins):
| Category | Example keywords |
|---|---|
| DNA Metabolism | DNA replication, DNA repair, DNA-binding |
| Transcription | Transcription, RNA-binding, Sigma factor |
| Translation & Ribosomes | Protein biosynthesis, Ribosomal protein, Elongation factor |
| Energy & Metabolism | ATP synthesis, Oxidoreductase, TCA cycle, Glycolysis |
| Transport & Membrane | Transport, Membrane, ABC transporter, Porin |
| Stress & Chaperones | Chaperone, Heat shock, Oxidative stress, Protease |
| Cell Division & Structure | Cell division, Peptidoglycan, Cell wall |
| Signaling & Regulation | Kinase, Two-component regulatory system, Signal transduction |
| Nucleotide Binding | ATP-binding, GTP-binding, Isomerase, Hydrolase |
| Other / Unclassified | No matching keywords found |
Each gene is assigned to exactly one category. Priority follows the order above.
Example Workflow
This tool is designed to run after a MAG annotation step in a Snakemake workflow:
rule annotation_report:
input:
annotation_dir = 'results/bakta'
output:
html = 'results/annotation_summary/annotation_gene_summary.html',
tsv = 'results/annotation_summary/annotation_gene_summary.tsv'
params:
outdir = 'results/annotation_summary'
conda:
'envs/annotation_report.yaml'
threads: 1
resources:
mem_mb=4000,
runtime=30
shell:
"""
python3 scripts/annotation_report.py \
--annotation_dir {input.annotation_dir} \
--outdir {params.outdir} \
--top_n 100
"""
Notes
- UniProt queries use the reviewed (Swiss-Prot) database only for high-quality annotations
- A 0.2 second delay is applied between UniProt API calls to respect rate limits
- For 100 gene products, the UniProt lookup phase takes approximately 30–40 seconds
- The
--no_uniprotflag is recommended for quick runs or environments without internet access
License
MIT License — see LICENSE for details.
Citation
If you use this tool in your research, please cite it as:
annoreport: a summary and visualization tool for MAG annotation workflows. https://github.com/keplerridge/annoreport
annoreport: a summary and visualization tool for MAG annotation workflows. https://github.com/keplerridge/annoreport
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file annoreport-0.1.1.tar.gz.
File metadata
- Download URL: annoreport-0.1.1.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d320854dc682f65587f4b3923d7bfc23006fc8aa153823f34e025ad643285db3
|
|
| MD5 |
34e0ada7ed9ac6b6abc5a6665e4584d4
|
|
| BLAKE2b-256 |
eb1d98909e476171de0b89ece5abfa857bf2833f1f4c846d08b1ba8836f3959e
|
File details
Details for the file annoreport-0.1.1-py3-none-any.whl.
File metadata
- Download URL: annoreport-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1c6131f4a712eafdd34bbef417b502f23f637c6fa15667efd72f81de568b6cd
|
|
| MD5 |
723697e5dfdd41c12b93f8a7632ae628
|
|
| BLAKE2b-256 |
0ba8657ebaac5cef24485a2848ff69daba0fd3544173da71114f51cb36fde1de
|