A lightweight Python package for mapping GEO platform probes to their corresponding gene identifiers in seconds!
Project description
๐งฌ PyProbeMapper
PyProbeMapper is a Python SDK and command-line tool designed to map GEO Platform (GPL) probe IDs to gene symbols for differential gene expression analysis within seconds. It leverages data from the HuggingFace Hub, processes mappings with multiple strategies (accession, coordinate, and direct lookup), and saves results locally for downstream bioinformatics pipelines.
It uses GPT-based inference to intelligently select relevant columns for mapping, saving time and handling variability across GEO platforms.
This tool is ideal for researchers and bioinformaticians working with GPL datasets who need accurate and efficient probe-to-gene mappings.
๐ Features
- โ Accurate Probe-to-Gene Mapping for GEO GPL platforms within seconds
- ๐ค GPT-based column inference to automatically select relevant columns, saving time and reducing complexity across diverse GEO platforms
- โก Fast retrieval of existing mappings from a HuggingFace-hosted Zarr dataset
- ๐ Community-driven mapping: Once a platform is mapped, results are pushed to a central HuggingFace Hub repository (Tinfloz/probe-gene-map), enabling global reuse and collaboration (over 1,000 platforms already mapped!)
- ๐ง Multiple mapping strategies: accession lookup, coordinate lookup, and direct lookup
- ๐ฅ๏ธ Interactive CLI for ease of use
- ๐พ Local storage of mappings as JSON files
- ๐ Push to HuggingFace Hub for sharing and versioning
- ๐งฉ Easy integration into bioinformatics pipelines or custom scripts
- ๐ Includes a built-in human gene reference dataset (Home_sapiens.GRCh38.genes.tsv)
๐ฆ Installation
Install py_probe_mapper from PyPI using your preferred package manager:
uv pip install py_probe_mapper
Or clone the repository and install locally:
git clone https://github.com/Tinfloz/Probe2GeneMapper
cd Probe2GeneMapper
uv pip install .
๐งช Example (Python SDK)
Use the map_probes function to map probe IDs to gene symbols for one or more GPL platforms:
from py_probe_mapper.sdk import map_probes
# Map probes for GPL570 and GPL96
results = map_probes(
gpl_ids=["GPL570", "GPL96"],
output_dir="./mappings",
force_rebuild=False
)
# Print results
for gpl_id, mappings in results.items():
if isinstance(mappings, dict):
print(f"{gpl_id}: Found {len(mappings)} mappings")
else:
print(f"{gpl_id}: {mappings}")
Output (example):
GPL570: Found 54675 mappings
GPL96: Found 22283 mappings
The mappings are saved as JSON files (e.g., GPL570_mappings.json) in the specified output_dir.
๐ป Example (CLI)
Launch the interactive CLI to map probes with a user-friendly interface:
probe-mapper
The CLI will guide you through:
- Enter up to 5 GPL IDs (e.g., GPL570,GPL96)
- Specify the output directory
- Provide optional API URL and key for inference services
- Choose whether to force rebuild existing mappings
- Select a logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
Results are saved as JSON files in the specified directory.
Sample Interaction:
๐ Welcome to the GPL Probe Mapper CLI! ๐
๐งฌ Enter up to 5 GPL platform identifiers (comma-separated, e.g., GPL570,GPL96): GPL570
๐ Enter output directory (default: .): ./mappings
๐ Enter API URL for inference service (optional, press enter to skip):
๐ Enter API key for inference service (optional, press enter to skip):
๐ Force rebuild mappings even if they exist? (default: No): No
๐ Select logging level: INFO
๐ Starting probe mapping... Please wait! โณ
๐ Mapping completed! ๐
๐ Results:
โ
GPL570: Found 54675 mappings ๐งฌ
๐ง Mapping Strategies
The tool supports three mapping strategies to ensure robust probe-to-gene mappings:
- Accession Lookup: Matches probes using accession numbers.
- Coordinate Lookup: Uses genomic coordinates for precise mapping.
- Direct Lookup: Directly maps probes to gene symbols when available.
Mappings are fetched from a HuggingFace dataset (Tinfloz/probe-gene-map) or built on-demand using metadata from GEO and the included Home_sapiens.GRCh38.genes.tsv reference.
๐ Project Structure
PyProbeMapper/
โโโ py_probe_mapper/
โ โโโ genome_utils/
โ โ โโโ Home_sapiens.GRCh38.genes.tsv # Human gene reference data
โ โโโ coordinate_lookup/
โ โ โโโ __init__.py
โ โ โโโ coordinate_lookup.py
โ โโโ accession_lookup/
โ โ โโโ __init__.py
โ โ โโโ accession_lookup.py
โ โโโ direct_lookup/
โ โ โโโ __init__.py
โ โ โโโ direct_lookup.py
โ โโโ lookup_classifier/
โ โ โโโ __init__.py
โ โ โโโ optimised_lookup_classifier.py
โ โโโ metadata_builder/
โ โ โโโ __init__.py
โ โ โโโ build_metadata.py
โ โโโ __init__.py
โ โโโ cli.py # Interactive CLI
โ โโโ sdk.py # Core SDK
โโโ pyproject.toml # Package configuration
โโโ README.md # This file
๐ ๏ธ Requirements
- Python 3.12+
- questionary>=2.0.0
- fsspec>=2023.1.0
- zarr>=2.14.0
- pandas>=1.5.0
- huggingface_hub>=0.17.0
Install dependencies automatically with:
pip install py_probe_mapper
๐ License
AGPL 3.0 License
This project is licensed under the AGPL 3.0 License.
See the LICENSE file for details.
๐ Usage Notes
Data Access: The included Home_sapiens.GRCh38.genes.tsv file is used for coordinate-based mapping
HuggingFace Integration: Mappings are stored in a Zarr dataset on HuggingFace (Tinfloz/probe-gene-map). Set force_rebuild=True to regenerate mappings if needed.
๐ Contributing
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/your-feature). - Commit your changes (
git commit -m 'Add your feature'). - Push to the branch (
git push origin feature/your-feature). - Open a pull request.
Please include tests.
๐ง Contact
For questions or support, open an issue on the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_probe_mapper-0.1.0.tar.gz.
File metadata
- Download URL: py_probe_mapper-0.1.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5ec916ba8440e0c24072928437dca95476a3359f62916a8f9ac82fc3bd9e64f
|
|
| MD5 |
08f346e031c11a23fffd356bccae2d31
|
|
| BLAKE2b-256 |
bbfc8b0ba54c9672a955315dfe9f53c63eb2f77103e924753928afcef456b80a
|
File details
Details for the file py_probe_mapper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: py_probe_mapper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
285d6b68584d44c3e5274cb8b86470819eb935aa2a3c400f0a1698423a8b0948
|
|
| MD5 |
c90ba999abb1a88d33f4ec1c0a4ae897
|
|
| BLAKE2b-256 |
f5042e1bd715ec8a1e2ef16ea592473835615eedda29c0f0a4a46aa400d414d5
|