A comprehensive tool for running gene enrichment analysis workflows using multiple enrichment tools and AI-powered summarization
Project description
AI Gene Enrichment
A comprehensive Python package for performing gene enrichment analysis using multiple bioinformatics tools and AI-powered synthesis. This tool integrates Enrichr, ToppFun, gProfiler, literature search, and AI summarization to provide detailed biological insights from gene lists.
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0). This means you are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
Under the following terms:
- Attribution — You must give appropriate credit and indicate if changes were made
- NonCommercial — You may not use the material for commercial purposes
For commercial licensing inquiries, please contact: cboyce3@mgh.harvard.edu
Installation
From PyPI (Recommended)
pip install ai-gene-enrichment
From Source
-
Clone this repository:
git clone https://github.com/calvinrboyce/ai-gene-enrichment.git cd ai-gene-enrichment
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install in development mode:
pip install -e .
-
Set up your OpenAI API key:
# Create a .env file echo "OPENAI_API_KEY=your_openai_api_key_here" > .env
Usage
AI synthesis requires an API key to access OpenAI's models. API keys can be obtained by PIs or other researchers by setting up an organization at platform.openai.com and adding an API key at platform.openai.com/settings/organization/api-keys. Note that using more sophisticated models beyond the default may require organization verification and you may have to earn a higher usage tier with OpenAI. With the default model you can expect to spend <$0.03 per gene list
Basic Usage
import os
import dotenv
from aige import AIGeneEnrichment
# Load environment variables
dotenv.load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
# Initialize the agent
aige = AIGeneEnrichment(openai_api_key)
# Define your gene list
genes = ["ANLN", "CENPF", "NUSAP1", "TOP2A", "CCNB1", "PRC1", "TPX2", "UBE2C", "BIRC5"]
# Run analysis
results = aige.run_analysis(
genes=genes,
email="your.email@example.com", # Required for NCBI literature search
background_genes=[], # Optional background gene set for enrichment analysis
ranked=True, # Set to True if genes are ranked by differential expression
search_terms=["cancer", "metastasis", "cell cycle"], # Optional literature search terms
context="This gene list characterizes a cluster of cells in a brain metastasis"
)
# Access results
print(results['summary'])
for theme in results['themes']:
print(f"Theme: {theme['theme']}")
print(f"Description: {theme['description']}")
Advanced Configuration
# Customize enrichment sources and parameters
aige = AIGeneEnrichment(
open_ai_api_key=openai_api_key,
open_ai_model="gpt-4o",
results_dir="custom_results",
enrichr_sources={
"CellMarker_2024": "CellMarker",
"MSigDB_Hallmark_2020": "MSigDB-H",
"ChEA_2022": "ChEA"
},
gprofiler_sources=['HPA', 'MIRNA'],
toppfun_sources=['TFBS', 'MicroRNA'],
terms_per_source=20, # Number of terms to retrieve per source
papers_per_gene=3, # Number of papers to retrieve per gene
max_papers=15 # Number of papers to retrieve for full gene list
)
# Example with background gene set
background_genes = ["GENE1", "GENE2", "GENE3", ...] # Your background gene set
results = agent.run_analysis(
genes=genes,
email="your.email@example.com",
background_genes=background_genes, # Use custom background for enrichment
ranked=True,
search_terms=["cancer", "metastasis"],
context="Analysis with custom background gene set"
)
Background Gene Sets
The agent supports custom background gene sets for enrichment analysis, which can improve the statistical significance and biological relevance of your results. When a background gene set is provided:
- Enrichr: Uses the background genes as the reference set for statistical testing
- gProfiler: Uses the background genes as the custom background for enrichment calculations
- ToppFun: Currently uses default background (background genes not yet supported)
Note: When no background genes are provided (empty list), the tools use their default reference sets.
Working with Results
The analysis returns a structured dictionary with the following components:
results = {
'themes': [
{
'theme': 'Cell Cycle Regulation',
'description': 'Genes involved in cell cycle control and progression...',
'terms': ['GO:0007049', 'KEGG:04110', ...]
}
],
'summary': 'Comprehensive analysis summary...'
}
Output Files
When save_results=True (default), the agent creates an Excel spreadsheet of the results
Dependencies
openai>=1.0.0: OpenAI API integrationrequests>=2.31.0: HTTP requestspython-dotenv>=1.0.0: Environment variable managementbiopython>=1.82: PubMed literature searchgprofiler-official>=1.0.0: gProfiler API clientopenpyxl>=3.1.5: Excel file generation
API Reference
AIGeneEnrichment
Main class for running gene enrichment analysis workflows.
Constructor Parameters
open_ai_api_key(str): OpenAI API key (required)open_ai_model(str): OpenAI model to use (default: "gpt-4.1-mini")results_dir(str): Directory to save results (default: "aige_results")enrichr_sources(dict): Enrichr sources to use. Dictionary should map the name of the source (found in Enrichr's documentation) to an abbreviated version to be used in the analysis (default: {})gprofiler_sources(list): gProfiler sources to use. List should include terms from ['TF', 'MIRNA', 'HPA', 'CORUM'] to be used in the analysis (default: [])toppfun_sources(list): ToppFun sources to use. List should include terms from ['MousePheno', 'Domain', 'Pubmed', 'Cytoband', 'TFBS', 'GeneFamily', 'Coexpression', 'CoexpressionAtlas', 'ToppCell', 'Computational', 'MicroRNA', 'Drug', 'Disease'] to be used in the analysis (default: ['ToppCell'])terms_per_source(int): Number of terms per source (default: 20)num_papers(int): Number of papers to fetch from lit search (default: 20)
Methods
run_analysis(genes, email, background_genes=[], ranked=True, search_terms=[], context="None", save_results=True, analysis_name=None)
Run the complete gene enrichment analysis workflow.
Parameters:
genes(List[str]): List of gene symbols to analyze (required)email(str): Email address for NCBI's reference (required for literature search)background_genes(List[str]): List of background genes to use for enrichment analysis (default: [])ranked(bool): Whether the genes are ranked by differential expression (default: True)search_terms(List[str]): List of search terms to use in literature search (default: [])context(str): Context of where the genes came from and what you're studying (default: "None")save_results(bool): Whether to save results to files (default: True)analysis_name(str): Name for the analysis run, used in file naming (default: None)
Returns:
Dict[str, Any]: Themed results dictionary containing:themes: List of identified biological themes, each with:theme: Theme namedescription: Description of the theme's function and identification rationaleconfidence: Confidence score (0-1)terms: List of enrichment terms that contributed to theme identification
summary: Comprehensive analysis summary
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai-gene-enrichment-0.1.0.tar.gz.
File metadata
- Download URL: ai-gene-enrichment-0.1.0.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c5e74bde853770adefe1d9ee1c3c34917c2a11e9eb8d617779099a8f83c4362
|
|
| MD5 |
d5126ac7efa6c03642fb1b968b24b6eb
|
|
| BLAKE2b-256 |
680b2275c4dbb20f3338d2c792b58015019bcfef55503e0e3a83ae183a19becd
|
File details
Details for the file ai_gene_enrichment-0.1.0-py2.py3-none-any.whl.
File metadata
- Download URL: ai_gene_enrichment-0.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba63882b6536a288c9478cf60a71c4cc8116a6a45b9824ed6be517d0b8d52831
|
|
| MD5 |
303056e215fc89ee1bce7b3879146f83
|
|
| BLAKE2b-256 |
2725fe40b9c6f84d7e2231dc0aa5102164a98505330d83dd93577447f550c2a3
|