A comprehensive Python package for retrieving detailed gene information from multiple public databases

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

GeneInfo

A comprehensive Python package for retrieving detailed gene information from multiple public databases with robust error handling, batch processing capabilities, and modular architecture.

Features

GeneInfo provides access to comprehensive gene annotation data through a unified interface:

Core Gene Information

Basic gene data - Gene symbols, Ensembl IDs, descriptions, genomic coordinates, biotypes
Transcripts - All transcript variants with protein coding information and alternative splicing
Genomic location - Chromosome coordinates, strand information, gene boundaries

Functional Annotation

Protein domains - Domain architecture from UniProt with evidence codes
Gene Ontology - GO terms and annotations (Biological Process, Molecular Function, Cellular Component)
Pathways - Reactome pathway associations and pathway hierarchies
Protein interactions - Dual-source protein-protein interaction networks:
- BioGRID - Experimental evidence with PubMed references (requires API key)
- STRING-db - Computational predictions + experimental evidence (no API key required)

Evolutionary Information

Homologs - Paralogs and orthologs across species with similarity metrics
Cross-species mapping - Gene orthology relationships and conservation scores

Clinical & Disease Data

Clinical variants - ClinVar pathogenic and benign variants with clinical significance
GWAS associations - Genome-wide association study data from EBI GWAS Catalog
Disease phenotypes - OMIM disease associations and phenotypic descriptions

Advanced Features

Batch processing - Concurrent processing of large gene lists (1000+ genes)
API key management - Secure handling of NCBI Entrez and OMIM API keys via environment variables or CLI
Graceful degradation - Works without API keys with limited functionality (no clinical/phenotype data)
Rate limiting - Built-in API courtesy delays and error handling
Rich CLI - Beautiful command-line interface with progress bars and tables
Export formats - JSON, CSV output with detailed and summary views
Real data only - No mock data fallbacks, returns null when data is inaccessible

Installation

Using uv (Recommended)

# Install from source
uv add git+https://github.com/chunjie-sam-liu/geneinfo.git

# Or clone and install locally
git clone https://github.com/chunjie-sam-liu/geneinfo.git
cd geneinfo
uv add -e .

Using pip

# Install from source
pip install git+https://github.com/chunjie-sam-liu/geneinfo.git

# Or clone and install locally
git clone https://github.com/chunjie-sam-liu/geneinfo.git
cd geneinfo
pip install -e .

Requirements

Python 3.11+
Internet connection for API access (offline mode available)

Quick Start

API Key Configuration

For accessing ClinVar (clinical variants), OMIM (phenotype data), and BioGRID (protein interactions), you'll need API keys:

Create a .env file in your project directory:

# API Keys for external services
OMIM_API_KEY="your_omim_api_key_here"
ENTREZ_API_KEY="your_entrez_api_key_here"
ENTREZ_EMAIL="your.email@example.com"
BIOGRID_API_KEY="your_biogrid_api_key_here"

Get API keys:
- OMIM API Key: Register at OMIM API
- Entrez API Key: Register at NCBI API
- BioGRID API Key: Register at BioGRID API
API key priority:
- CLI arguments (highest priority)
- Environment variables from .env file
- None (graceful degradation - returns null data)

Python API

from geneinfo import GeneInfo

# Option 1: Use environment variables (recommended)
# Create .env file with API keys (see above)
gene_info = GeneInfo()

# Option 2: Provide API keys explicitly
gene_info = GeneInfo(
    email="your.email@example.com",
    entrez_api_key="your_entrez_key",
    omim_api_key="your_omim_key",
    biogrid_api_key="your_biogrid_key"
)

# Option 3: Work without API keys (limited functionality)
gene_info = GeneInfo(
    email=None,
    entrez_api_key=None,
    omim_api_key=None,
    biogrid_api_key=None
)

# Get comprehensive information for a single gene
result = gene_info.get_gene_info("TP53")
print(f"Gene: {result['basic_info']['display_name']}")
print(f"Description: {result['basic_info']['description']}")
print(f"Chromosome: {result['basic_info']['seq_region_name']}")
print(f"Transcripts: {len(result['transcripts'])}")
print(f"GO terms: {len(result['gene_ontology'])}")
print(f"Pathways: {len(result['pathways'])}")
print(f"Protein interactions: {len(result['protein_interactions'])} (BioGRID + STRING-db)")
print(f"Clinical variants: {len(result['clinvar'])} (requires API key)")

# Batch process multiple genes with concurrent workers
genes = ["TP53", "BRCA1", "EGFR", "MYC", "KRAS"]
df = gene_info.get_batch_info(genes, max_workers=5)
print(df[['gene_symbol', 'chromosome', 'transcript_count', 'go_term_count']].head())

# Export detailed information to JSON
gene_info.export_detailed_info(genes, "detailed_results.json")

# Export to organized directory structure
gene_info.export_batch_to_directory(genes, "gene_data/", max_workers=5)

Advanced Usage

# Process large gene lists efficiently
with open("large_gene_list.txt") as f:
    gene_list = [line.strip() for line in f if line.strip()]

# Initialize with API keys for full functionality
gene_info = GeneInfo(
    email="researcher@university.edu",
    entrez_api_key="your_entrez_key",
    omim_api_key="your_omim_key",
    biogrid_api_key="your_biogrid_key"
)

# Batch processing with progress tracking
df = gene_info.get_batch_info(gene_list, max_workers=10)

# Filter successful results
successful = df[df['error'].isna()]
print(f"Successfully processed {len(successful)}/{len(gene_list)} genes")

# Access specific data types
for _, gene in successful.iterrows():
    detailed = gene_info.get_gene_info(gene['query'])

    # Protein domains
    if detailed['protein_domains']:
        print(f"\n{gene['gene_symbol']} protein domains:")
        for domain in detailed['protein_domains'][:3]:
            print(f"  - {domain['name']}: {domain['start']}-{domain['end']}")

    # Protein interactions (dual sources)
    if detailed['protein_interactions']:
        biogrid_interactions = [i for i in detailed['protein_interactions']
                              if i.get('source_database') == 'BioGRID']
        stringdb_interactions = [i for i in detailed['protein_interactions']
                               if i.get('source_database') == 'STRING-db']
        print(f"  - {len(biogrid_interactions)} BioGRID interactions (experimental)")
        print(f"  - {len(stringdb_interactions)} STRING-db interactions (computational)")

    # Clinical variants (requires Entrez API key)
    if detailed['clinvar']:
        pathogenic = [v for v in detailed['clinvar']
                     if 'pathogenic' in v.get('clinical_significance', '').lower()]
        print(f"  - {len(pathogenic)} pathogenic variants found")

# Working without API keys (limited functionality)
gene_info_limited = GeneInfo(
    entrez_api_key=None,
    omim_api_key=None,
    biogrid_api_key=None
)

# This will still work but return empty for clinical/phenotype data
result = gene_info_limited.get_gene_info("TP53")
print(f"Basic info available: {bool(result['basic_info'])}")
print(f"Protein interactions: {len(result['protein_interactions'])} (STRING-db only)")
print(f"Clinical variants: {len(result['clinvar'])} (empty without API key)")
print(f"OMIM phenotypes: {bool(result['phenotypes'])} (empty without API key)")

Command Line Interface

# Single gene information with rich output
geneinfo --gene TP53 --output tp53_info.json

# Using API keys via CLI arguments
geneinfo --gene TP53 --entrez-api-key YOUR_ENTREZ_KEY --omim-api-key YOUR_OMIM_KEY --biogrid-api-key YOUR_BIOGRID_KEY --output tp53_info.json

# Using environment variables (recommended - create .env file)
geneinfo --gene TP53 --output tp53_info.json

# Process multiple genes from file
geneinfo --file genes.txt --output results.csv

# Detailed information in JSON format
geneinfo --gene BRCA1 --detailed --output brca1_detailed.json

# Batch processing with custom workers and API keys
geneinfo --file large_gene_list.txt --workers 10 \
  --entrez-api-key YOUR_ENTREZ_KEY \
  --omim-api-key YOUR_OMIM_KEY \
  --biogrid-api-key YOUR_BIOGRID_KEY \
  --email your.email@example.com \
  --output batch_results.csv

# Export to organized directory structure
geneinfo --file genes.txt --output-dir gene_analysis/ --workers 8

# Verbose output for debugging
geneinfo --gene TP53 --verbose --detailed --output tp53_debug.json

# Process Ensembl IDs
geneinfo --gene ENSG00000141510 --output tp53_ensembl.json

# Species-specific queries (when supported)
geneinfo --gene TP53 --species human --output tp53_human.json

# Check CLI help for all options
geneinfo --help

CLI Output Examples

The CLI provides beautiful, formatted output with:

📊 Progress bars for batch processing
🎨 Colored tables for gene information display
⚡ Real-time processing statistics
📝 Summary reports with success/failure counts
🔍 Verbose logging for troubleshooting

Input Formats & Output

Supported Input Formats

The package accepts multiple gene identifier formats:

Gene symbols: TP53, BRCA1, EGFR (case-insensitive)
Ensembl Gene IDs: ENSG00000141510, ENSG00000012048
Mixed lists: Can process files containing both symbols and IDs

Output Formats

Summary CSV Output

query,gene_symbol,ensembl_id,chromosome,start_pos,end_pos,strand,transcript_count,go_term_count,pathway_count,interaction_count,clinvar_count,error
TP53,TP53,ENSG00000141510,17,7668421,7687490,-1,12,87,23,71,1043,
BRCA1,BRCA1,ENSG00000012048,17,43044295,43170245,-1,27,34,15,45,892,

Detailed JSON Output

{
  "query": "TP53",
  "basic_info": {
    "id": "ENSG00000141510",
    "display_name": "TP53",
    "description": "tumor protein p53",
    "seq_region_name": "17",
    "start": 7668421,
    "end": 7687490,
    "strand": -1,
    "biotype": "protein_coding"
  },
  "transcripts": [...],
  "protein_domains": [...],
  "gene_ontology": [...],
  "pathways": [...],
  "protein_interactions": [...],
  "paralogs": [...],
  "orthologs": [...],
  "clinvar": [...],
  "gwas": {...}
}

Directory Export Structure

gene_data/
├── summary.csv              # Overview of all processed genes
├── TP53_ENSG00000141510.json
├── BRCA1_ENSG00000012048.json
└── EGFR_ENSG00000073756.json

Data Sources & Architecture

Primary Data Sources

🧬 Ensembl - Gene annotation, transcripts, genomic coordinates, homologs
🔬 UniProt - Protein domains, functional annotations, protein features
🎯 Gene Ontology - GO term annotations and functional classifications
🛤️ Reactome - Biological pathways and pathway hierarchies
🏥 ClinVar - Clinical variant classifications and disease associations
🧪 EBI GWAS Catalog - Genome-wide association study results
💊 OMIM - Mendelian disorders and phenotype-genotype relationships
📚 MyGene.info - Enhanced gene annotation aggregation
🔗 BioGRID - Experimental protein-protein interactions with evidence
🌐 STRING-db - Computational + experimental protein interaction networks

Modular Fetcher Architecture

The package uses a modular design with specialized fetchers:

# Genomic data fetchers
from geneinfo.fetchers.genomic import EnsemblFetcher, MyGeneFetcher

# Protein data fetchers
from geneinfo.fetchers.protein import UniProtFetcher, StringDBFetcher, BioGRIDFetcher

# Functional annotation fetchers
from geneinfo.fetchers.functional import GOFetcher, ReactomeFetcher

# Clinical data fetchers
from geneinfo.fetchers.clinical import ClinVarFetcher, GwasFetcher, OMIMFetcher

Robust Error Handling

🔄 Graceful degradation - Returns null data when APIs are unavailable or API keys missing
⏱️ Rate limiting with respectful API usage
🛡️ SSL/TLS handling for various certificate configurations
📝 Comprehensive logging with different verbosity levels
🔍 Input validation for gene symbols and Ensembl IDs
🔑 API key management - Secure environment variable handling

Performance & Usage Examples

Performance Characteristics

Throughput: ~100-500 genes/minute (network dependent)
Concurrency: Configurable worker threads (default: 5, max recommended: 10)
Memory: Efficient streaming processing for large gene lists
Rate limiting: Built-in delays to respect API usage policies

Real-world Usage Examples

Cancer Gene Panel Analysis

# Process a cancer gene panel with API keys for clinical data
cancer_genes = ["TP53", "BRCA1", "BRCA2", "EGFR", "KRAS", "PIK3CA", "AKT1"]
gene_info = GeneInfo(
    email="researcher@university.edu",
    entrez_api_key="your_entrez_key",
    omim_api_key="your_omim_key",
    biogrid_api_key="your_biogrid_key"
)

results = gene_info.get_batch_info(cancer_genes)
# Filter for genes with clinical variants (requires Entrez API key)
cancer_variants = results[results['clinvar_count'] > 0]
print(f"Found clinical variants in {len(cancer_variants)} cancer genes")

# Analyze protein interaction networks
for gene in cancer_genes:
    detailed = gene_info.get_gene_info(gene)
    interactions = detailed['protein_interactions']
    if interactions:
        biogrid_count = len([i for i in interactions if i['source_database'] == 'BioGRID'])
        stringdb_count = len([i for i in interactions if i['source_database'] == 'STRING-db'])
        print(f"{gene}: {biogrid_count} experimental + {stringdb_count} predicted interactions")

Pathway Enrichment Preprocessing

# Prepare data for pathway analysis
gene_list = ["TP53", "MDM2", "CDKN1A", "BAX", "BBC3"]  # p53 pathway genes
detailed_results = [gene_info.get_gene_info(gene) for gene in gene_list]

# Extract GO terms for enrichment analysis
all_go_terms = []
for result in detailed_results:
    for go_term in result['gene_ontology']:
        all_go_terms.append({
            'gene': result['query'],
            'go_id': go_term['go_id'],
            'go_name': go_term['go_name'],
            'namespace': go_term['namespace']
        })

Large-scale Genomics Project

# Process GWAS significant genes (thousands of genes)
with open("gwas_significant_genes.txt") as f:
    gwas_genes = [line.strip() for line in f]  # 5000+ genes

# Process in batches with progress tracking
gene_info.export_batch_to_directory(
    gwas_genes,
    "gwas_gene_annotation/",
    max_workers=8
)
# Creates organized directory with individual files + summary

Development & Testing

Running Tests

# Install development dependencies
uv add --dev pytest pytest-cov pytest-asyncio

# Run test suite
uv run pytest

# Run with coverage
uv run pytest --cov=geneinfo --cov-report=html

Project Structure

geneinfo/
├── geneinfo/
│   ├── __init__.py          # Main package exports
│   ├── core.py              # GeneInfo main class
│   ├── cli.py               # Command-line interface
│   ├── mock_data.py         # Fallback data for offline mode
│   └── fetchers/            # Modular data fetchers
│       ├── base.py          # Base fetcher with common functionality
│       ├── genomic.py       # Ensembl, MyGene fetchers
│       ├── protein.py       # UniProt, STRING-db fetchers
│       ├── functional.py    # GO, Reactome fetchers
│       └── clinical.py      # ClinVar, GWAS, OMIM fetchers
├── tests/                   # Comprehensive test suite
├── examples/                # Usage examples and demos
├── docs/                    # Documentation (you are here!)
└── pyproject.toml          # Modern Python packaging

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Follow the coding standards in .github/copilot-instructions.md
Add tests for new functionality
Run the test suite: uv run pytest
Submit a pull request

Dependencies & Requirements

Core Dependencies

Python 3.11+ - Modern Python features and type hints
requests - HTTP client for API calls
pandas - Data manipulation and analysis
numpy - Numerical computing
typer - CLI framework with rich features
rich - Beautiful terminal output and progress bars
biopython - Bioinformatics tools (for Entrez/ClinVar)
mygene - Enhanced gene annotation client
python-dotenv - Environment variable management for API keys

System Requirements

Internet connection for API access
API keys for full functionality (NCBI Entrez, OMIM, BioGRID)
Sufficient memory for large gene lists (typically <1GB for 10,000 genes)
Email address for ClinVar/NCBI Entrez access (required when using API keys)

Troubleshooting

Common Issues

API Access Problems

# Test API connectivity
geneinfo --gene TP53 --verbose

# Working without API keys (limited functionality)
geneinfo --gene TP53 --entrez-api-key=None --omim-api-key=None --output results.json

API Key Configuration

# Check if API keys are being loaded correctly
geneinfo --gene TP53 --verbose

# Set API keys via environment variables (recommended)
echo 'ENTREZ_API_KEY="your_key_here"' > .env
echo 'OMIM_API_KEY="your_key_here"' >> .env
echo 'BIOGRID_API_KEY="your_key_here"' >> .env
echo 'ENTREZ_EMAIL="your.email@example.com"' >> .env

# Or pass via CLI
geneinfo --gene TP53 --entrez-api-key YOUR_ENTREZ_KEY --omim-api-key YOUR_OMIM_KEY --biogrid-api-key YOUR_BIOGRID_KEY --email your@email.com

Large Gene List Processing

# For very large lists, reduce concurrent workers
geneinfo --file huge_gene_list.txt --workers 3 --output results.csv

# Process in smaller batches if memory is limited
split -l 1000 huge_gene_list.txt batch_

Getting Help

📖 Check the examples/ directory for usage patterns
🐛 Report issues on GitHub with verbose output logs
💬 Include gene lists and error messages in bug reports
📧 Use --verbose flag for detailed debugging information

License & Citation

License

MIT License - see LICENSE file for details.

Citation

If you use GeneInfo in your research, please cite:

@software{geneinfo2025,
  author = {Liu, Chunjie},
  title = {GeneInfo: Comprehensive Gene Information Retrieval},
  url = {https://github.com/chunjie-sam-liu/geneinfo},
  version = {0.1.0},
  year = {2025}
}

Acknowledgments

This package aggregates data from multiple public biological databases. Please also cite the original data sources in your publications:

Ensembl: Cunningham et al. (2022) Nucleic Acids Research
UniProt: The UniProt Consortium (2023) Nucleic Acids Research
Gene Ontology: Aleksander et al. (2023) Genetics
Reactome: Gillespie et al. (2022) Nucleic Acids Research
ClinVar: Landrum et al. (2020) Nucleic Acids Research
BioGRID: Oughtred et al. (2021) Nucleic Acids Research
STRING: Szklarczyk et al. (2023) Nucleic Acids Research
GWAS Catalog: Sollis et al. (2023) Nucleic Acids Research

Author: Chunjie Liu Contact: chunjie.sam.liu.at.gmail.com Version: 0.1.0 Date: 2025-08-06

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

chunjiesamliu1024

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.2

Aug 8, 2025

0.3.1

Aug 8, 2025

0.3.0

Aug 7, 2025

0.1.0

Aug 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genesummary-0.3.2.tar.gz (148.0 kB view details)

Uploaded Aug 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

genesummary-0.3.2-py3-none-any.whl (33.2 kB view details)

Uploaded Aug 8, 2025 Python 3

File details

Details for the file genesummary-0.3.2.tar.gz.

File metadata

Download URL: genesummary-0.3.2.tar.gz
Upload date: Aug 8, 2025
Size: 148.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for genesummary-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`0bac7fdb2ecfb466205fefd57130e087c6bb198b7b784fb57f5eedfded5757c0`
MD5	`c45404944271cb1d82988c363088afbb`
BLAKE2b-256	`507fe50e556d650da5ffc65fc44a7b3a65fddb205307609163b6d98e3ffca92b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for genesummary-0.3.2.tar.gz:

Publisher: python-package.yml on chunjie-sam-liu/geneinfo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: genesummary-0.3.2.tar.gz
- Subject digest: 0bac7fdb2ecfb466205fefd57130e087c6bb198b7b784fb57f5eedfded5757c0
- Sigstore transparency entry: 368133912
- Sigstore integration time: Aug 8, 2025
Source repository:
- Permalink: chunjie-sam-liu/geneinfo@60e9ee019bee9693c211de70fa0cdd746882b40d
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/chunjie-sam-liu
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@60e9ee019bee9693c211de70fa0cdd746882b40d
- Trigger Event: release

File details

Details for the file genesummary-0.3.2-py3-none-any.whl.

File metadata

Download URL: genesummary-0.3.2-py3-none-any.whl
Upload date: Aug 8, 2025
Size: 33.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for genesummary-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed5c84b820db47d147b832f667957ba0cb78184b6e749d08c904924daf32e712`
MD5	`357303b890f22834b8af271e900fc5aa`
BLAKE2b-256	`fc9aed35f3ff69417c85c7fd436c430271c5769b411630b360ccaf6a3f88a872`

See more details on using hashes here.

Provenance

The following attestation bundles were made for genesummary-0.3.2-py3-none-any.whl:

Publisher: python-package.yml on chunjie-sam-liu/geneinfo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: genesummary-0.3.2-py3-none-any.whl
- Subject digest: ed5c84b820db47d147b832f667957ba0cb78184b6e749d08c904924daf32e712
- Sigstore transparency entry: 368133965
- Sigstore integration time: Aug 8, 2025
Source repository:
- Permalink: chunjie-sam-liu/geneinfo@60e9ee019bee9693c211de70fa0cdd746882b40d
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/chunjie-sam-liu
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@60e9ee019bee9693c211de70fa0cdd746882b40d
- Trigger Event: release

genesummary 0.3.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

GeneInfo

Features

Core Gene Information

Functional Annotation

Evolutionary Information

Clinical & Disease Data

Advanced Features

Installation

Using uv (Recommended)

Using pip

Requirements

Quick Start

API Key Configuration

Python API

Advanced Usage

Command Line Interface

CLI Output Examples

Input Formats & Output

Supported Input Formats

Output Formats

Summary CSV Output

Detailed JSON Output

Directory Export Structure

Data Sources & Architecture

Primary Data Sources

Modular Fetcher Architecture

Robust Error Handling

Performance & Usage Examples

Performance Characteristics

Real-world Usage Examples

Cancer Gene Panel Analysis

Pathway Enrichment Preprocessing

Large-scale Genomics Project

Development & Testing

Running Tests

Project Structure

Contributing

Dependencies & Requirements

Core Dependencies

System Requirements

Troubleshooting

Common Issues

API Access Problems

API Key Configuration

Large Gene List Processing

Getting Help

License & Citation

License

Citation

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata