Skip to main content

A lightweight Python package for mapping GEO platform probes to their corresponding gene identifiers in seconds!

Project description

๐Ÿงฌ PyProbeMapper

PyProbeMapper is a Python SDK and command-line tool designed to map GEO Platform (GPL) probe IDs to gene symbols for differential gene expression analysis within seconds. It leverages data from the HuggingFace Hub, processes mappings with multiple strategies (accession, coordinate, and direct lookup), and saves results locally for downstream bioinformatics pipelines.

It uses GPT-based inference to intelligently select relevant columns for mapping, saving time and handling variability across GEO platforms.

This tool is ideal for researchers and bioinformaticians working with GPL datasets who need accurate and efficient probe-to-gene mappings.

๐Ÿ” Features

  • โœ… Accurate Probe-to-Gene Mapping for GEO GPL platforms within seconds
  • ๐Ÿค– GPT-based column inference to automatically select relevant columns, saving time and reducing complexity across diverse GEO platforms
  • โšก Fast retrieval of existing mappings from a HuggingFace-hosted Zarr dataset
  • ๐ŸŒ Community-driven mapping: Once a platform is mapped, results are pushed to a central HuggingFace Hub repository (Tinfloz/probe-gene-map), enabling global reuse and collaboration (over 1,000 platforms already mapped!)
  • ๐Ÿง  Multiple mapping strategies: accession lookup, coordinate lookup, and direct lookup
  • ๐Ÿ–ฅ๏ธ Interactive CLI for ease of use
  • ๐Ÿ’พ Local storage of mappings as JSON files
  • ๐ŸŒ Push to HuggingFace Hub for sharing and versioning
  • ๐Ÿงฉ Easy integration into bioinformatics pipelines or custom scripts
  • ๐Ÿ“Š Includes a built-in human gene reference dataset (Home_sapiens.GRCh38.genes.tsv)

๐Ÿ“ฆ Installation

Install py_probe_mapper from PyPI using your preferred package manager:

uv pip install py_probe_mapper

Or clone the repository and install locally:

git clone https://github.com/Tinfloz/Probe2GeneMapper
cd Probe2GeneMapper
uv pip install .

๐Ÿงช Example (Python SDK)

Use the map_probes function to map probe IDs to gene symbols for one or more GPL platforms:

from py_probe_mapper.sdk import map_probes

# Map probes for GPL570 and GPL96
results = map_probes(
    gpl_ids=["GPL570", "GPL96"],
    output_dir="./mappings",
    force_rebuild=False
)

# Print results
for gpl_id, mappings in results.items():
    if isinstance(mappings, dict):
        print(f"{gpl_id}: Found {len(mappings)} mappings")
    else:
        print(f"{gpl_id}: {mappings}")

Output (example):

GPL570: Found 54675 mappings
GPL96: Found 22283 mappings

The mappings are saved as JSON files (e.g., GPL570_mappings.json) in the specified output_dir.

๐Ÿ’ป Example (CLI)

Launch the interactive CLI to map probes with a user-friendly interface:

probe-mapper

The CLI will guide you through:

  • Enter up to 5 GPL IDs (e.g., GPL570,GPL96)
  • Specify the output directory
  • Provide optional API URL and key for inference services
  • Choose whether to force rebuild existing mappings
  • Select a logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Results are saved as JSON files in the specified directory.

Sample Interaction:

๐ŸŒŸ Welcome to the GPL Probe Mapper CLI! ๐ŸŒŸ

๐Ÿงฌ Enter up to 5 GPL platform identifiers (comma-separated, e.g., GPL570,GPL96): GPL570
๐Ÿ“‚ Enter output directory (default: .): ./mappings
๐Ÿ”— Enter API URL for inference service (optional, press enter to skip): 
๐Ÿ”‘ Enter API key for inference service (optional, press enter to skip): 
๐Ÿ”„ Force rebuild mappings even if they exist? (default: No): No
๐Ÿ“‹ Select logging level: INFO

๐Ÿš€ Starting probe mapping... Please wait! โณ
๐ŸŽ‰ Mapping completed! ๐ŸŽ‰
๐Ÿ“Š Results:
โœ… GPL570: Found 54675 mappings ๐Ÿงฌ

๐Ÿง  Mapping Strategies

The tool supports three mapping strategies to ensure robust probe-to-gene mappings:

  1. Accession Lookup: Matches probes using accession numbers.
  2. Coordinate Lookup: Uses genomic coordinates for precise mapping.
  3. Direct Lookup: Directly maps probes to gene symbols when available.

Mappings are fetched from a HuggingFace dataset (Tinfloz/probe-gene-map) or built on-demand using metadata from GEO and the included Home_sapiens.GRCh38.genes.tsv reference.

๐Ÿ“ Project Structure

PyProbeMapper/
โ”œโ”€โ”€ py_probe_mapper/
โ”‚   โ”œโ”€โ”€ genome_utils/
โ”‚   โ”‚   โ””โ”€โ”€ Home_sapiens.GRCh38.genes.tsv  # Human gene reference data
โ”‚   โ”œโ”€โ”€ coordinate_lookup/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ coordinate_lookup.py
โ”‚   โ”œโ”€โ”€ accession_lookup/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ accession_lookup.py
โ”‚   โ”œโ”€โ”€ direct_lookup/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ direct_lookup.py
โ”‚   โ”œโ”€โ”€ lookup_classifier/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ optimised_lookup_classifier.py
โ”‚   โ”œโ”€โ”€ metadata_builder/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ build_metadata.py
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ cli.py                            # Interactive CLI                           
โ”‚   โ””โ”€โ”€ sdk.py                            # Core SDK
โ”œโ”€โ”€ pyproject.toml                        # Package configuration
โ”œโ”€โ”€ README.md                             # This file

๐Ÿ› ๏ธ Requirements

  • Python 3.12+
  • questionary>=2.0.0
  • fsspec>=2023.1.0
  • zarr>=2.14.0
  • pandas>=1.5.0
  • huggingface_hub>=0.17.0

Install dependencies automatically with:

pip install py_probe_mapper

๐Ÿ“– License

AGPL 3.0 License

This project is licensed under the AGPL 3.0 License.

See the LICENSE file for details.

๐Ÿ“š Usage Notes

Data Access: The included Home_sapiens.GRCh38.genes.tsv file is used for coordinate-based mapping

HuggingFace Integration: Mappings are stored in a Zarr dataset on HuggingFace (Tinfloz/probe-gene-map). Set force_rebuild=True to regenerate mappings if needed.

๐Ÿš€ Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/your-feature).
  3. Commit your changes (git commit -m 'Add your feature').
  4. Push to the branch (git push origin feature/your-feature).
  5. Open a pull request.

Please include tests.

๐Ÿ“ง Contact

For questions or support, open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_probe_mapper-0.1.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_probe_mapper-0.1.0-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file py_probe_mapper-0.1.0.tar.gz.

File metadata

  • Download URL: py_probe_mapper-0.1.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for py_probe_mapper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e5ec916ba8440e0c24072928437dca95476a3359f62916a8f9ac82fc3bd9e64f
MD5 08f346e031c11a23fffd356bccae2d31
BLAKE2b-256 bbfc8b0ba54c9672a955315dfe9f53c63eb2f77103e924753928afcef456b80a

See more details on using hashes here.

File details

Details for the file py_probe_mapper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for py_probe_mapper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 285d6b68584d44c3e5274cb8b86470819eb935aa2a3c400f0a1698423a8b0948
MD5 c90ba999abb1a88d33f4ec1c0a4ae897
BLAKE2b-256 f5042e1bd715ec8a1e2ef16ea592473835615eedda29c0f0a4a46aa400d414d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page