Skip to main content

A comprehensive Python package for retrieving detailed gene information from multiple public databases

Project description

GeneInfo

A comprehensive Python package for retrieving detailed gene information from multiple public databases with robust error handling, batch processing capabilities, and modular architecture.

Features

GeneInfo provides access to comprehensive gene annotation data through a unified interface:

Core Gene Information

  • Basic gene data - Gene symbols, Ensembl IDs, descriptions, genomic coordinates, biotypes
  • Transcripts - All transcript variants with protein coding information and alternative splicing
  • Genomic location - Chromosome coordinates, strand information, gene boundaries

Functional Annotation

  • Protein domains - Domain architecture from UniProt with evidence codes
  • Gene Ontology - GO terms and annotations (Biological Process, Molecular Function, Cellular Component)
  • Pathways - Reactome pathway associations and pathway hierarchies
  • Protein interactions - Protein-protein interaction networks from STRING-db

Evolutionary Information

  • Homologs - Paralogs and orthologs across species with similarity metrics
  • Cross-species mapping - Gene orthology relationships and conservation scores

Clinical & Disease Data

  • Clinical variants - ClinVar pathogenic and benign variants with clinical significance
  • GWAS associations - Genome-wide association study data from EBI GWAS Catalog
  • Disease phenotypes - OMIM disease associations and phenotypic descriptions

Advanced Features

  • Batch processing - Concurrent processing of large gene lists (1000+ genes)
  • Offline mode - Mock data fallback when external APIs are unavailable
  • Rate limiting - Built-in API courtesy delays and error handling
  • Rich CLI - Beautiful command-line interface with progress bars and tables
  • Export formats - JSON, CSV output with detailed and summary views

Installation

Using uv (Recommended)

# Install from source
uv add git+https://github.com/chunjie-sam-liu/geneinfo.git

# Or clone and install locally
git clone https://github.com/chunjie-sam-liu/geneinfo.git
cd geneinfo
uv add -e .

Using pip

# Install from source
pip install git+https://github.com/chunjie-sam-liu/geneinfo.git

# Or clone and install locally
git clone https://github.com/chunjie-sam-liu/geneinfo.git
cd geneinfo
pip install -e .

Requirements

  • Python 3.11+
  • Internet connection for API access (offline mode available)

Quick Start

Quick Start

Python API

from geneinfo import GeneInfo

# Initialize with species and email for clinical data
gene_info = GeneInfo(species="human", email="your.email@example.com")

# Get comprehensive information for a single gene
result = gene_info.get_gene_info("TP53")
print(f"Gene: {result['basic_info']['display_name']}")
print(f"Description: {result['basic_info']['description']}")
print(f"Chromosome: {result['basic_info']['seq_region_name']}")
print(f"Transcripts: {len(result['transcripts'])}")
print(f"GO terms: {len(result['gene_ontology'])}")
print(f"Pathways: {len(result['pathways'])}")
print(f"Clinical variants: {len(result['clinvar'])}")

# Batch process multiple genes with concurrent workers
genes = ["TP53", "BRCA1", "EGFR", "MYC", "KRAS"]
df = gene_info.get_batch_info(genes, max_workers=5)
print(df[['gene_symbol', 'chromosome', 'transcript_count', 'go_term_count']].head())

# Export detailed information to JSON
gene_info.export_detailed_info(genes, "detailed_results.json")

# Export to organized directory structure
gene_info.export_batch_to_directory(genes, "gene_data/", max_workers=5)

Advanced Usage

# Process large gene lists efficiently
with open("large_gene_list.txt") as f:
    gene_list = [line.strip() for line in f if line.strip()]

# Batch processing with progress tracking
df = gene_info.get_batch_info(gene_list, max_workers=10)

# Filter successful results
successful = df[df['error'].isna()]
print(f"Successfully processed {len(successful)}/{len(gene_list)} genes")

# Access specific data types
for _, gene in successful.iterrows():
    detailed = gene_info.get_gene_info(gene['query'])

    # Protein domains
    if detailed['protein_domains']:
        print(f"\n{gene['gene_symbol']} protein domains:")
        for domain in detailed['protein_domains'][:3]:
            print(f"  - {domain['name']}: {domain['start']}-{domain['end']}")

    # Clinical variants
    if detailed['clinvar']:
        pathogenic = [v for v in detailed['clinvar']
                     if 'pathogenic' in v.get('clinical_significance', '').lower()]
        print(f"  - {len(pathogenic)} pathogenic variants found")

Command Line Interface

# Single gene information with rich output
geneinfo --gene TP53 --output tp53_info.json

# Process multiple genes from file
geneinfo --file genes.txt --output results.csv

# Detailed information in JSON format
geneinfo --gene BRCA1 --detailed --output brca1_detailed.json

# Batch processing with custom workers and email for clinical data
geneinfo --file large_gene_list.txt --workers 10 --email your.email@example.com --output batch_results.csv

# Export to organized directory structure
geneinfo --file genes.txt --output-dir gene_analysis/ --workers 8

# Verbose output for debugging
geneinfo --gene TP53 --verbose --detailed --output tp53_debug.json

# Process Ensembl IDs
geneinfo --gene ENSG00000141510 --output tp53_ensembl.json

# Species-specific queries (when supported)
geneinfo --gene TP53 --species human --output tp53_human.json

CLI Output Examples

The CLI provides beautiful, formatted output with:

  • ๐Ÿ“Š Progress bars for batch processing
  • ๐ŸŽจ Colored tables for gene information display
  • โšก Real-time processing statistics
  • ๐Ÿ“ Summary reports with success/failure counts
  • ๐Ÿ” Verbose logging for troubleshooting

Input Formats & Output

Supported Input Formats

The package accepts multiple gene identifier formats:

  • Gene symbols: TP53, BRCA1, EGFR (case-insensitive)
  • Ensembl Gene IDs: ENSG00000141510, ENSG00000012048
  • Mixed lists: Can process files containing both symbols and IDs

Output Formats

Summary CSV Output

query,gene_symbol,ensembl_id,chromosome,start_pos,end_pos,strand,transcript_count,go_term_count,pathway_count,interaction_count,clinvar_count,error
TP53,TP53,ENSG00000141510,17,7668421,7687490,-1,12,87,23,156,1043,
BRCA1,BRCA1,ENSG00000012048,17,43044295,43170245,-1,27,34,15,89,892,

Detailed JSON Output

{
  "query": "TP53",
  "basic_info": {
    "id": "ENSG00000141510",
    "display_name": "TP53",
    "description": "tumor protein p53",
    "seq_region_name": "17",
    "start": 7668421,
    "end": 7687490,
    "strand": -1,
    "biotype": "protein_coding"
  },
  "transcripts": [...],
  "protein_domains": [...],
  "gene_ontology": [...],
  "pathways": [...],
  "protein_interactions": [...],
  "paralogs": [...],
  "orthologs": [...],
  "clinvar": [...],
  "gwas": {...}
}

Directory Export Structure

gene_data/
โ”œโ”€โ”€ summary.csv              # Overview of all processed genes
โ”œโ”€โ”€ TP53_ENSG00000141510.json
โ”œโ”€โ”€ BRCA1_ENSG00000012048.json
โ””โ”€โ”€ EGFR_ENSG00000073756.json

Data Sources & Architecture

Primary Data Sources

  • ๐Ÿงฌ Ensembl - Gene annotation, transcripts, genomic coordinates, homologs
  • ๐Ÿ”ฌ UniProt - Protein domains, functional annotations, protein features
  • ๐ŸŽฏ Gene Ontology - GO term annotations and functional classifications
  • ๐Ÿ›ค๏ธ Reactome - Biological pathways and pathway hierarchies
  • ๐Ÿฅ ClinVar - Clinical variant classifications and disease associations
  • ๐Ÿ“Š STRING-db - Protein-protein interaction networks and evidence
  • ๐Ÿงช EBI GWAS Catalog - Genome-wide association study results
  • ๐Ÿ’Š OMIM - Mendelian disorders and phenotype-genotype relationships
  • ๐Ÿ“š MyGene.info - Enhanced gene annotation aggregation

Modular Fetcher Architecture

The package uses a modular design with specialized fetchers:

# Genomic data fetchers
from geneinfo.fetchers.genomic import EnsemblFetcher, MyGeneFetcher

# Protein data fetchers
from geneinfo.fetchers.protein import UniProtFetcher, StringDBFetcher

# Functional annotation fetchers
from geneinfo.fetchers.functional import GOFetcher, ReactomeFetcher

# Clinical data fetchers
from geneinfo.fetchers.clinical import ClinVarFetcher, GwasFetcher, OMIMFetcher

Robust Error Handling

  • ๐Ÿ”„ Automatic fallback to mock data when APIs are unavailable
  • โฑ๏ธ Rate limiting with respectful API usage
  • ๐Ÿ›ก๏ธ SSL/TLS handling for various certificate configurations
  • ๐Ÿ“ Comprehensive logging with different verbosity levels
  • ๐Ÿ” Input validation for gene symbols and Ensembl IDs

Performance & Usage Examples

Performance Characteristics

  • Throughput: ~100-500 genes/minute (network dependent)
  • Concurrency: Configurable worker threads (default: 5, max recommended: 10)
  • Memory: Efficient streaming processing for large gene lists
  • Rate limiting: Built-in delays to respect API usage policies

Real-world Usage Examples

Cancer Gene Panel Analysis

# Process a cancer gene panel
cancer_genes = ["TP53", "BRCA1", "BRCA2", "EGFR", "KRAS", "PIK3CA", "AKT1"]
gene_info = GeneInfo(email="researcher@university.edu")

results = gene_info.get_batch_info(cancer_genes)
# Filter for genes with clinical variants
cancer_variants = results[results['clinvar_count'] > 0]
print(f"Found clinical variants in {len(cancer_variants)} cancer genes")

Pathway Enrichment Preprocessing

# Prepare data for pathway analysis
gene_list = ["TP53", "MDM2", "CDKN1A", "BAX", "BBC3"]  # p53 pathway genes
detailed_results = [gene_info.get_gene_info(gene) for gene in gene_list]

# Extract GO terms for enrichment analysis
all_go_terms = []
for result in detailed_results:
    for go_term in result['gene_ontology']:
        all_go_terms.append({
            'gene': result['query'],
            'go_id': go_term['go_id'],
            'go_name': go_term['go_name'],
            'namespace': go_term['namespace']
        })

Large-scale Genomics Project

# Process GWAS significant genes (thousands of genes)
with open("gwas_significant_genes.txt") as f:
    gwas_genes = [line.strip() for line in f]  # 5000+ genes

# Process in batches with progress tracking
gene_info.export_batch_to_directory(
    gwas_genes,
    "gwas_gene_annotation/",
    max_workers=8
)
# Creates organized directory with individual files + summary

Development & Testing

Running Tests

# Install development dependencies
uv add --dev pytest pytest-cov pytest-asyncio

# Run test suite
uv run pytest

# Run with coverage
uv run pytest --cov=geneinfo --cov-report=html

Project Structure

geneinfo/
โ”œโ”€โ”€ geneinfo/
โ”‚   โ”œโ”€โ”€ __init__.py          # Main package exports
โ”‚   โ”œโ”€โ”€ core.py              # GeneInfo main class
โ”‚   โ”œโ”€โ”€ cli.py               # Command-line interface
โ”‚   โ”œโ”€โ”€ mock_data.py         # Fallback data for offline mode
โ”‚   โ””โ”€โ”€ fetchers/            # Modular data fetchers
โ”‚       โ”œโ”€โ”€ base.py          # Base fetcher with common functionality
โ”‚       โ”œโ”€โ”€ genomic.py       # Ensembl, MyGene fetchers
โ”‚       โ”œโ”€โ”€ protein.py       # UniProt, STRING-db fetchers
โ”‚       โ”œโ”€โ”€ functional.py    # GO, Reactome fetchers
โ”‚       โ””โ”€โ”€ clinical.py      # ClinVar, GWAS, OMIM fetchers
โ”œโ”€โ”€ tests/                   # Comprehensive test suite
โ”œโ”€โ”€ examples/                # Usage examples and demos
โ”œโ”€โ”€ docs/                    # Documentation (you are here!)
โ””โ”€โ”€ pyproject.toml          # Modern Python packaging

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Follow the coding standards in .github/copilot-instructions.md
  4. Add tests for new functionality
  5. Run the test suite: uv run pytest
  6. Submit a pull request

Dependencies & Requirements

Core Dependencies

  • Python 3.11+ - Modern Python features and type hints
  • requests - HTTP client for API calls
  • pandas - Data manipulation and analysis
  • numpy - Numerical computing
  • typer - CLI framework with rich features
  • rich - Beautiful terminal output and progress bars
  • biopython - Bioinformatics tools (for Entrez/ClinVar)
  • mygene - Enhanced gene annotation client

System Requirements

  • Internet connection for API access (offline mode available)
  • Sufficient memory for large gene lists (typically <1GB for 10,000 genes)
  • Email address for ClinVar/NCBI Entrez access (optional but recommended)

Troubleshooting

Common Issues

API Access Problems

# Test API connectivity
geneinfo --gene TP53 --verbose

# Use offline mode when APIs are unavailable
# The package automatically falls back to mock data

Large Gene List Processing

# For very large lists, reduce concurrent workers
geneinfo --file huge_gene_list.txt --workers 3 --output results.csv

# Process in smaller batches if memory is limited
split -l 1000 huge_gene_list.txt batch_

Email Configuration for ClinVar

# Provide a valid email for NCBI Entrez access
gene_info = GeneInfo(email="your.email@institution.edu")

Getting Help

  • ๐Ÿ“– Check the examples/ directory for usage patterns
  • ๐Ÿ› Report issues on GitHub with verbose output logs
  • ๐Ÿ’ฌ Include gene lists and error messages in bug reports
  • ๐Ÿ“ง Use --verbose flag for detailed debugging information

License & Citation

License

MIT License - see LICENSE file for details.

Citation

If you use GeneInfo in your research, please cite:

@software{geneinfo2025,
  author = {Liu, Chunjie},
  title = {GeneInfo: Comprehensive Gene Information Retrieval},
  url = {https://github.com/chunjie-sam-liu/geneinfo},
  version = {0.1.0},
  year = {2025}
}

Acknowledgments

This package aggregates data from multiple public biological databases. Please also cite the original data sources in your publications:

  • Ensembl: Cunningham et al. (2022) Nucleic Acids Research
  • UniProt: The UniProt Consortium (2023) Nucleic Acids Research
  • Gene Ontology: Aleksander et al. (2023) Genetics
  • Reactome: Gillespie et al. (2022) Nucleic Acids Research
  • ClinVar: Landrum et al. (2020) Nucleic Acids Research
  • STRING: Szklarczyk et al. (2023) Nucleic Acids Research
  • GWAS Catalog: Sollis et al. (2023) Nucleic Acids Research

Author: Chunjie Liu Contact: chunjie.sam.liu.at.gmail.com Version: 0.1.0 Date: 2025-08-06

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genesummary-0.1.0.tar.gz (138.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genesummary-0.1.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file genesummary-0.1.0.tar.gz.

File metadata

  • Download URL: genesummary-0.1.0.tar.gz
  • Upload date:
  • Size: 138.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for genesummary-0.1.0.tar.gz
Algorithm Hash digest
SHA256 12fea9ffdfa6b2afca41a2b7afe4c1f6423bd19b84bd57b5eef9b1ba0b6f20de
MD5 908df3ca6c3aa2c9f5b0bace7cca59f1
BLAKE2b-256 66c6db7973bac22ee5fa7f282309f68394d632f4ee0da4123644dc3be76cb998

See more details on using hashes here.

Provenance

The following attestation bundles were made for genesummary-0.1.0.tar.gz:

Publisher: python-package.yml on chunjie-sam-liu/geneinfo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file genesummary-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: genesummary-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for genesummary-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac18b2bf8ddcaaeced8264b54ca0527b9bf7003abb32e41ccbeaa8a52295522b
MD5 26be282b4fa244d20eccb075231697e0
BLAKE2b-256 caa0d3429638b965e8d51df859e35cb4352cc4bd9595be7360b0aee0aee49c21

See more details on using hashes here.

Provenance

The following attestation bundles were made for genesummary-0.1.0-py3-none-any.whl:

Publisher: python-package.yml on chunjie-sam-liu/geneinfo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page