Functional enrichment analysis and visualization for Python

These details have not been verified by PyPI

Project links

Homepage

Project description

richAnn

Rich Annotation: Functional Enrichment Analysis and Visualization for Python

Overview

richAnn is a comprehensive Python package for gene set enrichment analysis, providing powerful statistical methods and publication-quality visualizations. Inspired by the R package richR, richAnn brings advanced enrichment analysis capabilities to the Python ecosystem.

Key Features

Multiple Enrichment Methods
- Gene Ontology (GO) enrichment analysis
- KEGG pathway analysis
- Gene Set Enrichment Analysis (GSEA) with permutation testing
Advanced Analysis
- Kappa-based term clustering to group similar results
- Cross-sample comparison for differential enrichment
- Multiple testing correction (Benjamini-Hochberg, Bonferroni, etc.)
Publication-Quality Visualizations
- Bar plots and dot plots for enrichment results
- Network graphs (term-gene, term-term, multi-database networks)
- Heatmaps for cross-sample comparison
- GSEA enrichment score plots
Flexible Data Loading
- GMT file support for custom gene sets
- GAF file parser for GO annotations
- KEGG mapping file support
- Custom annotation builder

Installation

From Source

git clone https://github.com/guokai8/richAnn.git
cd richAnn
pip install -e .

Development Installation

To install with development dependencies (pytest, black, flake8):

pip install -e ".[dev]"

Dependencies

Python ≥ 3.8
numpy ≥ 1.21.0
pandas ≥ 1.3.0
scipy ≥ 1.7.0
matplotlib ≥ 3.4.0
seaborn ≥ 0.11.0
networkx ≥ 2.6.0
statsmodels ≥ 0.13.0

Quick Start

Basic GO Enrichment Analysis

import richAnn as ra
import pandas as pd

# Your gene list of interest
genes = ["BRCA1", "BRCA2", "TP53", "ATM", "CHEK2", "RAD51", "PALB2"]

# Load GO annotation (or prepare your own DataFrame)
go_data = pd.DataFrame({
    'GeneID': ['BRCA1', 'BRCA2', 'TP53', 'ATM', ...],
    'GOterm': ['GO:0006281', 'GO:0006281', ...],
    'GOname': ['DNA repair', 'DNA repair', ...],
    'Ontology': ['BP', 'BP', ...]
})

# Perform GO enrichment
result = ra.richGO(
    genes=genes,
    godata=go_data,
    ontology="BP",  # Biological Process
    pvalue=0.05,
    padj=0.05,
    minSize=5,
    maxSize=500
)

# View results
print(result)
result.summary()

# Export results
result.to_csv("go_enrichment_results.csv")

Visualize Results

# Bar plot
fig = ra.ggbar(result, top=10, filename="barplot.png")

# Dot plot
fig = ra.ggdot(result, top=10, filename="dotplot.png")

# Network visualization
fig = ra.ggnetwork(result, top=20, filename="network.png")

KEGG Pathway Enrichment

# Load KEGG annotation
kegg_data = pd.DataFrame({
    'GeneID': ['BRCA1', 'BRCA2', ...],
    'Pathway': ['hsa03440', 'hsa03440', ...],
    'PathwayName': ['Homologous recombination', ...]
})

# Perform KEGG enrichment
kegg_result = ra.richKEGG(
    genes=genes,
    kodata=kegg_data,
    pvalue=0.05,
    padj=0.05
)

Gene Set Enrichment Analysis (GSEA)

# Gene scores (e.g., from differential expression analysis)
gene_scores = {
    "BRCA1": 2.5,
    "BRCA2": 2.3,
    "TP53": 2.1,
    "CDK1": -1.8,
    "CDK2": -1.5,
    # ... more genes
}

# Load gene set database
geneset_db = pd.DataFrame({
    'GeneSet': ['DNA_REPAIR', 'DNA_REPAIR', ...],
    'GeneSetName': ['DNA repair pathway', ...],
    'GeneID': ['BRCA1', 'BRCA2', ...]
})

# Perform GSEA
gsea_result = ra.richGSEA(
    gene_scores=gene_scores,
    geneset_db=geneset_db,
    nperm=1000,
    min_size=15,
    max_size=500
)

# Visualize specific pathway
fig = ra.ggGSEA(gsea_result, gene_scores, term_id="DNA_REPAIR")

Advanced Features

Term Clustering

Cluster similar enrichment terms based on gene overlap:

# Cluster enrichment results
clustered = ra.richCluster(
    result,
    cutoff=0.5,      # Kappa score threshold
    minSize=5,       # Minimum cluster size
    escore=3         # Enrichment score cutoff (-log10(padj))
)

# Visualize clustered network
fig = ra.ggnetwork(clustered, top=30)

Cross-Sample Comparison

Compare enrichment across multiple conditions:

# Perform enrichment for multiple samples
sample1_result = ra.richGO(sample1_genes, go_data, ontology="BP")
sample2_result = ra.richGO(sample2_genes, go_data, ontology="BP")
sample3_result = ra.richGO(sample3_genes, go_data, ontology="BP")

# Compare results
comparison = ra.compareResult({
    'Control': sample1_result,
    'Treatment1': sample2_result,
    'Treatment2': sample3_result
})

# Visualize comparison
fig = ra.ggheatmap(comparison, top=30)
fig = ra.comparedot(comparison, top=10)

Load Annotations from Standard Formats

# Load from GMT file (e.g., MSigDB)
geneset_db = ra.load_gmt("c2.cp.kegg.v7.4.symbols.gmt")

# Load GO annotations from GAF file
go_data = ra.load_go_gaf(
    "goa_human.gaf",
    ontology="BP",
    evidence_codes=["EXP", "IDA", "IPI", "IMP", "IGI", "IEP"]
)

# Load KEGG pathway mapping
kegg_data = ra.load_kegg_mapping("kegg_human.txt", organism="hsa")

# Create custom annotation
custom_sets = {
    "DNA_REPAIR": ["BRCA1", "BRCA2", "TP53", "ATM"],
    "CELL_CYCLE": ["CDK1", "CDK2", "CCNA1", "CCNB1"]
}
custom_annot = ra.create_custom_annotation(custom_sets)

Validate Annotation Quality

# Validate annotation format and get statistics
stats = ra.validate_annotation_format(
    go_data,
    required_cols=['GeneID', 'GOterm', 'GOname', 'Ontology']
)

print(f"Valid: {stats['valid']}")
print(f"Unique genes: {stats['n_genes']}")
print(f"Unique annotations: {stats['n_annotations']}")
print(f"Warnings: {stats['warnings']}")

Integration with pathwaydb

richAnn integrates seamlessly with pathwaydb, a lightweight Python library for downloading and querying biological pathway annotations. This is the recommended approach for production use.

Why Use pathwaydb?

Easy setup: Download annotations with one command
Offline access: Query locally without internet
Always up-to-date: Fresh data from official sources (GO, KEGG)
12 model species: Human, mouse, rat, zebrafish, fly, worm, yeast, and more
Fast queries: Millisecond lookups on SQLite databases

Installation

# Install pathwaydb
pip install pathwaydb

# Or install from source
git clone https://github.com/guokai8/pathwaydb.git
cd pathwaydb && pip install -e .

Quick Start with pathwaydb

GO Enrichment Analysis

import richAnn as ra
from pathwaydb import GO

# Step 1: Download GO annotations (only needed once)
go = GO(storage_path='go_human.db')
go.download_annotations(species='human')
# Output: Downloaded 500,000+ annotations with term names

# Step 2: Convert to richAnn format
go_data = ra.from_pathwaydb_go(go.storage, ontology="BP")
print(f"Loaded {go_data['GOterm'].nunique()} GO terms for {go_data['GeneID'].nunique()} genes")

# Step 3: Run enrichment analysis
genes = ["BRCA1", "BRCA2", "TP53", "ATM", "CHEK2", "RAD51", "PALB2", "NBN"]
result = ra.richGO(genes, go_data, ontology="BP", pvalue=0.05, padj=0.1)

# Step 4: View and visualize results
print(result)
result.summary()
fig = ra.ggdot(result, top=15)

KEGG Pathway Enrichment

import richAnn as ra
from pathwaydb import KEGG

# Step 1: Download KEGG annotations (includes pathway hierarchy!)
kegg = KEGG(species='hsa', storage_path='kegg_human.db')
kegg.download_annotations()  # Automatically downloads pathway hierarchy
kegg.convert_ids_to_symbols()  # Convert Entrez IDs to gene symbols

# Step 2: Convert to richAnn format
kegg_data = ra.from_pathwaydb_kegg(kegg.storage)
print(f"Loaded {kegg_data['Pathway'].nunique()} pathways")

# Step 3: Run enrichment analysis
result = ra.richKEGG(genes, kegg_data, pvalue=0.05)

# Step 4: Results now include pathway hierarchy!
print(result.result[['Annot', 'Term', 'Level1', 'Level2', 'Padj']].head())
#         Annot                    Term           Level1                  Level2      Padj
# 0   hsa03440  Homologous recombination  Genetic Info...  Replication and repair  0.0001
# 1   hsa04110              Cell cycle    Cellular Processes        Cell growth...  0.0005

# Step 5: Visualize
fig = ra.ggbar(result, top=10)
fig = ra.ggnetwork(result, top=20)

KEGG Pathway Hierarchy Levels:

Level1: Top-level category (e.g., "Metabolism", "Human Diseases", "Cellular Processes")
Level2: Sub-category (e.g., "Carbohydrate metabolism", "Cancer", "Cell growth and death")
Level3: Pathway name (same as PathwayName)

Complete Workflow Example

import richAnn as ra
from pathwaydb import GO, KEGG

# =============================================================================
# Setup: Download annotations (run once, data is cached locally)
# =============================================================================

# GO annotations
go = GO(storage_path='go_human.db')
go.download_annotations(species='human')

# KEGG annotations
kegg = KEGG(species='hsa', storage_path='kegg_human.db')
kegg.download_annotations()
kegg.convert_ids_to_symbols()

# =============================================================================
# Analysis: Your gene list (e.g., from differential expression)
# =============================================================================

# Example: DNA damage response genes
my_genes = [
    "TP53", "BRCA1", "BRCA2", "ATM", "ATR", "CHEK1", "CHEK2",
    "RAD51", "PALB2", "NBN", "MRE11", "RAD50", "XRCC1", "PARP1"
]

# =============================================================================
# GO Enrichment (Biological Process)
# =============================================================================

go_data = ra.from_pathwaydb_go(go.storage, ontology="BP")
go_result = ra.richGO(
    genes=my_genes,
    godata=go_data,
    ontology="BP",
    pvalue=0.05,
    padj=0.1,
    minSize=5,
    maxSize=500
)

print("=== GO Enrichment Results ===")
go_result.summary()

# Top enriched terms
print("\nTop 10 GO Terms:")
for _, row in go_result.top(10).result.iterrows():
    print(f"  {row['Annot']}: {row['Term']} (Padj={row['Padj']:.2e})")

# =============================================================================
# KEGG Enrichment
# =============================================================================

kegg_data = ra.from_pathwaydb_kegg(kegg.storage)
kegg_result = ra.richKEGG(
    genes=my_genes,
    kodata=kegg_data,
    pvalue=0.05,
    padj=0.1
)

print("\n=== KEGG Enrichment Results ===")
kegg_result.summary()

# =============================================================================
# Visualizations
# =============================================================================

# GO dot plot
fig_go = ra.ggdot(go_result, top=15, usePadj=True)
fig_go.savefig("go_dotplot.png", dpi=300, bbox_inches='tight')

# KEGG bar plot
fig_kegg = ra.ggbar(kegg_result, top=10, colorBy='Padj')
fig_kegg.savefig("kegg_barplot.png", dpi=300, bbox_inches='tight')

# Combined network
fig_net = ra.ggnetmap(
    {'GO': go_result, 'KEGG': kegg_result},
    top=10
)
fig_net.savefig("combined_network.png", dpi=300, bbox_inches='tight')

# =============================================================================
# Export Results
# =============================================================================

go_result.to_csv("go_enrichment_results.csv")
kegg_result.to_csv("kegg_enrichment_results.csv")

Multi-Species Analysis

pathwaydb supports 12 model organisms. Use the same workflow for any species:

from pathwaydb import GO, get_supported_species

# List all supported species
print(get_supported_species())
# ['arabidopsis', 'chicken', 'cow', 'dog', 'fly', 'human', 'mouse', 'pig', 'rat', 'worm', 'yeast', 'zebrafish']

# Mouse analysis
go_mouse = GO(storage_path='go_mouse.db')
go_mouse.download_annotations(species='mouse')
go_data_mouse = ra.from_pathwaydb_go(go_mouse.storage, ontology="BP")

# Zebrafish analysis
go_zebrafish = GO(storage_path='go_zebrafish.db')
go_zebrafish.download_annotations(species='zebrafish')
go_data_zebrafish = ra.from_pathwaydb_go(go_zebrafish.storage, ontology="BP")

# Fly (Drosophila) analysis
go_fly = GO(storage_path='go_fly.db')
go_fly.download_annotations(species='fly')
go_data_fly = ra.from_pathwaydb_go(go_fly.storage, ontology="BP")

Using Cached Annotations

For repeated analyses, use pathwaydb's centralized cache:

from pathwaydb import GO

# Load from cache (auto-downloads if not cached)
go = GO.from_cache(species='human')

# Convert and analyze
go_data = ra.from_pathwaydb_go(go.storage, ontology="BP")
result = ra.richGO(genes, go_data, ontology="BP")

Filter GO by Evidence Codes

# Download with all evidence codes
go = GO(storage_path='go_human.db')
go.download_annotations(species='human')

# Filter for experimental evidence only during conversion
go_data = ra.from_pathwaydb_go(go.storage, ontology="BP")

# Or filter using pathwaydb before conversion
experimental = go.storage.filter(evidence_codes=['EXP', 'IDA', 'IPI', 'IMP', 'IGI', 'IEP'])
# Then manually build DataFrame if needed

pathwaydb Data Format Reference

GO columns from pathwaydb:

pathwaydb	richAnn	Description
`GeneID`	`GeneID`	Gene symbol
`TERM`	`GOterm`	GO term ID (GO:0006281)
`Aspect`	`Ontology`	P→BP, F→MF, C→CC
`term_name`	`GOname`	Term description

KEGG columns from pathwaydb:

pathwaydb	richAnn	Description
`GeneID`	`GeneID`	Gene symbol
`PATH`	`Pathway`	Pathway ID (hsa04110)
`Annot`	`PathwayName`	Pathway name
`Level1`	`Level1`	Top-level category (e.g., "Metabolism")
`Level2`	`Level2`	Sub-category (e.g., "Carbohydrate metabolism")
`Level3`	`Level3`	Pathway name (same as PathwayName)

Working with EnrichResult Objects

EnrichResult is the core data structure returned by all enrichment functions:

# Filter results
filtered = result.filter(
    pvalue=0.01,           # P-value cutoff
    padj=0.05,             # Adjusted p-value cutoff
    min_count=5,           # Minimum gene count
    min_richfactor=1.5     # Minimum enrichment factor
)

# Get top N results
top10 = result.top(n=10, orderby='Padj')

# Get gene-term details
detail_df = result.detail()  # Returns DataFrame with one row per gene-term pair

# Print summary statistics
result.summary()

# Export to various formats
result.to_csv("results.csv")
result.to_excel("results.xlsx", sheet_name="GO_BP")

# Access underlying DataFrame
df = result.result
print(df.columns)  # View available columns

Visualization Gallery

Bar Plot

ra.ggbar(result, top=15, colorBy='Padj', horiz=True, filename="barplot.pdf", dpi=300)

Dot Plot

ra.ggdot(result, top=15, usePadj=True, size_range=(50, 500), filename="dotplot.pdf")

Term-Gene Network

ra.ggnetplot(result, top=20, layout='spring', filename="term_gene_network.pdf")

Term-Term Similarity Network

ra.ggnetwork(result, top=30, weightcut=0.2, layout='spring', filename="term_network.pdf")

Multi-Database Network

ra.ggnetmap(
    {'GO': go_result, 'KEGG': kegg_result},
    top=15,
    filename="combined_network.pdf"
)

Comparison Heatmap

ra.ggheatmap(
    comparison_df,
    top=30,
    cluster_rows=True,
    cluster_cols=False,
    filename="heatmap.pdf"
)

Understanding the Results

Key Columns in Results

Annot: Annotation ID (e.g., GO:0006281, hsa03440)
Term: Human-readable term name
Pvalue: Raw p-value from hypergeometric test
Padj: Adjusted p-value (Benjamini-Hochberg FDR by default)
Count: Number of query genes in this term
GeneID: Semicolon-separated list of genes
RichFactor: Enrichment ratio (observed/expected)
GeneRatio: Query genes in term / total query genes (k/n)
BgRatio: Background genes in term / total background (M/N)
OddsRatio: Odds ratio from Fisher's exact test

Statistical Methods

Hypergeometric Test (richGO, richKEGG)

Tests for over-representation using the hypergeometric distribution:

P(X ≥ k) where k = number of overlapping genes
Accounts for gene set size and background universe

GSEA (richGSEA)

Ranks genes by score and calculates enrichment score:

Weighted running sum based on gene ranks
Permutation testing for significance
Normalized enrichment score (NES) accounts for gene set size

Tips and Best Practices

1. Choose Appropriate Background

# Option 1: Use all genes in annotation (default)
result = ra.richGO(genes, go_data, ontology="BP")

# Option 2: Provide custom background (recommended for RNA-seq)
background = list(expressed_genes)  # All genes detected in experiment
result = ra.richGO(genes, go_data, ontology="BP", universe=background)

2. Handle Gene ID Formatting

# Case-insensitive matching (default)
result = ra.richGO(genes, go_data, case_sensitive=False)

# Case-sensitive matching
result = ra.richGO(genes, go_data, case_sensitive=True)

3. Adjust Gene Set Size Limits

# Exclude very small and very large gene sets
result = ra.richGO(
    genes, go_data,
    minSize=10,   # Exclude terms with < 10 genes
    maxSize=300   # Exclude terms with > 300 genes
)

4. Multiple Testing Correction

# Benjamini-Hochberg FDR (default)
result = ra.richGO(genes, go_data, padj_method="BH")

# Bonferroni correction (more conservative)
result = ra.richGO(genes, go_data, padj_method="bonferroni")

# No correction (use with caution)
result = ra.richGO(genes, go_data, padj_method="none")

Common Issues and Solutions

Issue: "No query genes found in background universe"

Solution: Check gene ID format matches annotation. Try case_sensitive=False.

Issue: "No enriched terms found"

Solutions:

Relax p-value cutoff: pvalue=0.1
Adjust gene set size limits: minSize=3, maxSize=1000
Check that your gene list is biologically relevant
Verify annotation database matches your organism

Issue: GSEA is slow

Solution: Reduce permutations (but keep ≥ 1000): nperm=1000

Issue: Visualizations have overlapping labels

Solutions:

Reduce top parameter: top=10
Increase figure size: figsize=(14, 10)
Decrease font size: fontsize=8

Testing

Run the test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=richAnn

# Run specific test file
pytest tests/test_enrichment.py

# Run with verbose output
pytest -v

Development

Code Formatting

black richAnn/

Linting

flake8 richAnn/

Citation

If you use richAnn in your research, please cite:

[Citation information will be added upon publication]

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by the R package richR
Built on the excellent scientific Python ecosystem (NumPy, pandas, SciPy, matplotlib)

Contact

For questions and feedback:

Open an issue on GitHub
Email: support@richann.org

Changelog

Version 1.0.0 (Current)

New Features

GO, KEGG, and GSEA enrichment analysis
Kappa-based term clustering
8 visualization types
GMT, GAF, and KEGG file loaders
Custom annotation builder
Cross-sample comparison tools
Comprehensive test suite

Bug Fixes

Fixed module import error
Improved GSEA performance (28x faster)
Added edge case handling for visualizations
Enhanced error messages with actionable suggestions

Improvements

Added input validation for all functions
Comprehensive documentation with examples
Test coverage for core functionality
Better error handling throughout

Happy Enriching! 🧬

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

richann-1.0.0.tar.gz (45.6 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

richann-1.0.0-py3-none-any.whl (42.9 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file richann-1.0.0.tar.gz.

File metadata

Download URL: richann-1.0.0.tar.gz
Upload date: Feb 1, 2026
Size: 45.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for richann-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`02e8941b77d741a700e83c6557c1e349f105405b57975a22f3cd07d7b541dcdc`
MD5	`36be3dbb62122cc011de597426b2b3ac`
BLAKE2b-256	`fac0d3374f296b6f86b642b16c5e7904c372c721d03d3910c0fb7648e70643fb`

See more details on using hashes here.

File details

Details for the file richann-1.0.0-py3-none-any.whl.

File metadata

Download URL: richann-1.0.0-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for richann-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca0e95d9c5d025ea6ef9f87e62be84d1b5596c6d631b9e991a2df9f0df0622f8`
MD5	`7ccac974779e621298988ca3e1dfe423`
BLAKE2b-256	`b5130f2de4460bcf6d2742f7ed9e153ecfaaadefdb25acf45e6f17b860f8e948`

See more details on using hashes here.

richAnn 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

richAnn

Overview

Key Features

Installation

From Source

Development Installation

Dependencies

Quick Start

Basic GO Enrichment Analysis

Visualize Results

KEGG Pathway Enrichment

Gene Set Enrichment Analysis (GSEA)

Advanced Features

Term Clustering

Cross-Sample Comparison

Load Annotations from Standard Formats

Validate Annotation Quality

Integration with pathwaydb

Why Use pathwaydb?

Installation

Quick Start with pathwaydb

GO Enrichment Analysis

KEGG Pathway Enrichment

Complete Workflow Example

Multi-Species Analysis

Using Cached Annotations

Filter GO by Evidence Codes

pathwaydb Data Format Reference

Working with EnrichResult Objects

Visualization Gallery

Bar Plot

Dot Plot

Term-Gene Network

Term-Term Similarity Network

Multi-Database Network

Comparison Heatmap

Understanding the Results

Key Columns in Results

Statistical Methods

Hypergeometric Test (richGO, richKEGG)

GSEA (richGSEA)

Tips and Best Practices

1. Choose Appropriate Background

2. Handle Gene ID Formatting

3. Adjust Gene Set Size Limits

4. Multiple Testing Correction

Common Issues and Solutions

Issue: "No query genes found in background universe"

Issue: "No enriched terms found"

Issue: GSEA is slow

Issue: Visualizations have overlapping labels

Testing

Development

Code Formatting

Linting

Citation

Contributing

License

Acknowledgments

Contact

Changelog

Version 1.0.0 (Current)

New Features

Bug Fixes

Improvements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta