Skip to main content

A comprehensive tool for running gene enrichment analysis workflows using multiple enrichment tools and AI-powered summarization

Project description

AI Gene Enrichment

A comprehensive Python package for performing gene enrichment analysis using multiple bioinformatics tools and AI-powered synthesis. This tool integrates Enrichr, ToppFun, gProfiler, literature search, and AI summarization to provide detailed biological insights from gene lists.

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0). This means you are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

Under the following terms:

  • Attribution — You must give appropriate credit and indicate if changes were made
  • NonCommercial — You may not use the material for commercial purposes

For commercial licensing inquiries, please contact: cboyce3@mgh.harvard.edu

Installation

From PyPI (Recommended)

pip install ai-gene-enrichment

From Source

  1. Clone this repository:

    git clone https://github.com/calvinrboyce/ai-gene-enrichment.git
    cd ai-gene-enrichment
    
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install in development mode:

    pip install -e .
    
  4. Set up your OpenAI API key:

    # Create a .env file
    echo "OPENAI_API_KEY=your_openai_api_key_here" > .env
    

Usage

AI synthesis requires an API key to access OpenAI's models. API keys can be obtained by PIs or other researchers by setting up an organization at platform.openai.com and adding an API key at platform.openai.com/settings/organization/api-keys. Note that using more sophisticated models beyond the default may require organization verification and you may have to earn a higher usage tier with OpenAI. With the default model you can expect to spend <$0.03 per gene list

Basic Usage

import os
import dotenv
from aige import AIGeneEnrichment

# Load environment variables
dotenv.load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize the agent
aige = AIGeneEnrichment(openai_api_key)

# Define your gene list
genes = ["ANLN", "CENPF", "NUSAP1", "TOP2A", "CCNB1", "PRC1", "TPX2", "UBE2C", "BIRC5"]

# Run analysis
results = aige.run_analysis(
    genes=genes,
    email="your.email@example.com",  # Required for NCBI literature search
    background_genes=[],  # Optional background gene set for enrichment analysis
    ranked=True,  # Set to True if genes are ranked by differential expression
    search_terms=["cancer", "metastasis", "cell cycle"],  # Optional literature search terms
    context="This gene list characterizes a cluster of cells in a brain metastasis"
)

# Access results
print(results['summary'])
for theme in results['themes']:
    print(f"Theme: {theme['theme']}")
    print(f"Description: {theme['description']}")

Advanced Configuration

# Customize enrichment sources and parameters
aige = AIGeneEnrichment(
    open_ai_api_key=openai_api_key,
    open_ai_model="gpt-4o",
    results_dir="custom_results",
    enrichr_sources={
        "CellMarker_2024": "CellMarker",
        "MSigDB_Hallmark_2020": "MSigDB-H",
        "ChEA_2022": "ChEA"
    },
    gprofiler_sources=['HPA', 'MIRNA'],
    toppfun_sources=['TFBS', 'MicroRNA'],
    terms_per_source=20,  # Number of terms to retrieve per source
    papers_per_gene=3,    # Number of papers to retrieve per gene
    max_papers=15         # Number of papers to retrieve for full gene list
)

# Example with background gene set
background_genes = ["GENE1", "GENE2", "GENE3", ...]  # Your background gene set
results = agent.run_analysis(
    genes=genes,
    email="your.email@example.com",
    background_genes=background_genes,  # Use custom background for enrichment
    ranked=True,
    search_terms=["cancer", "metastasis"],
    context="Analysis with custom background gene set"
)

Background Gene Sets

The agent supports custom background gene sets for enrichment analysis, which can improve the statistical significance and biological relevance of your results. When a background gene set is provided:

  • Enrichr: Uses the background genes as the reference set for statistical testing
  • gProfiler: Uses the background genes as the custom background for enrichment calculations
  • ToppFun: Currently uses default background (background genes not yet supported)

Note: When no background genes are provided (empty list), the tools use their default reference sets.

Working with Results

The analysis returns a structured dictionary with the following components:

results = {
    'themes': [
        {
            'theme': 'Cell Cycle Regulation',
            'description': 'Genes involved in cell cycle control and progression...',
            'terms': ['GO:0007049', 'KEGG:04110', ...]
        }
    ],
    'summary': 'Comprehensive analysis summary...'
}

Output Files

When save_results=True (default), the agent creates an Excel spreadsheet of the results

Dependencies

  • openai>=1.0.0: OpenAI API integration
  • requests>=2.31.0: HTTP requests
  • python-dotenv>=1.0.0: Environment variable management
  • biopython>=1.82: PubMed literature search
  • gprofiler-official>=1.0.0: gProfiler API client
  • openpyxl>=3.1.5: Excel file generation

API Reference

AIGeneEnrichment

Main class for running gene enrichment analysis workflows.

Constructor Parameters

  • open_ai_api_key (str): OpenAI API key (required)
  • open_ai_model (str): OpenAI model to use (default: "gpt-4.1-mini")
  • results_dir (str): Directory to save results (default: "aige_results")
  • enrichr_sources (dict): Enrichr sources to use. Dictionary should map the name of the source (found in Enrichr's documentation) to an abbreviated version to be used in the analysis (default: {})
  • gprofiler_sources (list): gProfiler sources to use. List should include terms from ['TF', 'MIRNA', 'HPA', 'CORUM'] to be used in the analysis (default: [])
  • toppfun_sources (list): ToppFun sources to use. List should include terms from ['MousePheno', 'Domain', 'Pubmed', 'Cytoband', 'TFBS', 'GeneFamily', 'Coexpression', 'CoexpressionAtlas', 'ToppCell', 'Computational', 'MicroRNA', 'Drug', 'Disease'] to be used in the analysis (default: ['ToppCell'])
  • terms_per_source (int): Number of terms per source (default: 20)
  • num_papers (int): Number of papers to fetch from lit search (default: 20)

Methods

run_analysis(genes, email, background_genes=[], ranked=True, search_terms=[], context="None", save_results=True, analysis_name=None)

Run the complete gene enrichment analysis workflow.

Parameters:

  • genes (List[str]): List of gene symbols to analyze (required)
  • email (str): Email address for NCBI's reference (required for literature search)
  • background_genes (List[str]): List of background genes to use for enrichment analysis (default: [])
  • ranked (bool): Whether the genes are ranked by differential expression (default: True)
  • search_terms (List[str]): List of search terms to use in literature search (default: [])
  • context (str): Context of where the genes came from and what you're studying (default: "None")
  • save_results (bool): Whether to save results to files (default: True)
  • analysis_name (str): Name for the analysis run, used in file naming (default: None)

Returns:

  • Dict[str, Any]: Themed results dictionary containing:
    • themes: List of identified biological themes, each with:
      • theme: Theme name
      • description: Description of the theme's function and identification rationale
      • confidence: Confidence score (0-1)
      • terms: List of enrichment terms that contributed to theme identification
    • summary: Comprehensive analysis summary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai-gene-enrichment-0.1.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_gene_enrichment-0.1.0-py2.py3-none-any.whl (21.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file ai-gene-enrichment-0.1.0.tar.gz.

File metadata

  • Download URL: ai-gene-enrichment-0.1.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for ai-gene-enrichment-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7c5e74bde853770adefe1d9ee1c3c34917c2a11e9eb8d617779099a8f83c4362
MD5 d5126ac7efa6c03642fb1b968b24b6eb
BLAKE2b-256 680b2275c4dbb20f3338d2c792b58015019bcfef55503e0e3a83ae183a19becd

See more details on using hashes here.

File details

Details for the file ai_gene_enrichment-0.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for ai_gene_enrichment-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ba63882b6536a288c9478cf60a71c4cc8116a6a45b9824ed6be517d0b8d52831
MD5 303056e215fc89ee1bce7b3879146f83
BLAKE2b-256 2725fe40b9c6f84d7e2231dc0aa5102164a98505330d83dd93577447f550c2a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page