Extract and harmonize RNA-seq metadata from NCBI GEO
Project description
GEOtcha
Extract and harmonize RNA-seq metadata from NCBI GEO.
GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.
Installation
pip install geotcha
With optional extras:
pip install geotcha[ml] # ML harmonization (GLiNER + SapBERT + FAISS)
pip install geotcha[parquet] # Parquet export support
pip install geotcha[llm] # LLM harmonization
For development:
git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"
Quick Start
Search for datasets
geotcha search "inflammatory bowel disease"
Full pipeline with subset testing
geotcha run "IBD" --subset 5 --output ./results/ --harmonize
Extract from specific GSE IDs
geotcha extract GSE12345 GSE67890 --output ./results/
Output format selection
# CSV (default), TSV, or Parquet
geotcha run "IBD" --harmonize --format csv
geotcha extract GSE12345 -f tsv
geotcha run "IBD" -f parquet --output ./results/
Parquet requires pip install geotcha[parquet].
Include single-cell RNA-seq datasets (excluded by default)
geotcha run "IBD" --subset 5 --include-scrna
Disease packs
# List available packs
geotcha packs
# Use a pack for optimized search
geotcha run "IBD" --pack ibd --harmonize
geotcha run "breast cancer" --pack oncology --harmonize
Available packs: ibd, oncology, neurodegeneration, autoimmune, metabolic.
With ML harmonization (zero-shot NER + entity linking)
pip install geotcha[ml]
# Build FAISS ontology indices (one-time)
geotcha ml build-index
# Run with ML
geotcha run "IBD" --harmonize --ml-mode hybrid
ML fills in missing or low-confidence fields using GLiNER biomedical NER and SapBERT entity linking. Use --ml-mode full to let ML run on all fields.
With LLM harmonization
pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic
Combined: rules + ML + LLM
pip install "geotcha[ml,llm]"
geotcha run "IBD" --harmonize --ml-mode hybrid --llm
The harmonization chain runs in order: rules → ML → LLM. Each layer only upgrades fields that are still missing or low-confidence.
Structured JSON logging
geotcha run "IBD" --log-json --output ./results/
geotcha extract GSE12345 --log-json
Emits structured JSON log lines to stderr — useful for log aggregation in production pipelines.
Benchmark harmonization quality
# Run against bundled fixtures (100 curated datasets)
geotcha benchmark
# Custom fixtures and output
geotcha benchmark --input ./my_fixtures/ --output ./report.json
# Benchmark with ML enabled
geotcha benchmark --ml-mode hybrid
Produces a JSON report with per-field exact match, completeness, ontology coverage, and confidence metrics.
Run report
# After a pipeline run completes, view a summary:
geotcha report <run_id>
# Write report.json to a custom directory:
geotcha report <run_id> --output ./reports/
Prints run metadata (query, ID counts, failures, stage timings) and writes a report.json file.
CI / non-interactive mode
geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize
Python SDK
from geotcha import GEOtchaClient
client = GEOtchaClient(ncbi_api_key="...")
ids = client.search("ulcerative colitis")
records = client.extract(ids[:5])
records = client.harmonize(records, ml_mode="hybrid")
client.export(records, output_dir="./results", fmt="parquet")
# Benchmark harmonization quality
report = client.benchmark()
print(report["summary"]["overall_exact_match"])
The SDK has no Typer/Rich dependency — safe for notebooks, scripts, and downstream pipelines. Failed GSE parses are silently skipped.
Configuration
# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"
# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"
# View current configuration
geotcha config show
# Validate configuration
geotcha config validate
Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.
Output
GEOtcha produces:
gse_summary.csv— One row per GSE with series-level metadata (or.tsv/.parquetwith--format)gsm/<GSE_ID>_samples.csv— Per-GSE file with sample-level metadatamanifest.json(in run state dir) — Audit trail: run_id, query, timestamps, stage timings, counts, masked settingsreview_queue.csv— Low-confidence harmonized fields flagged for manual review (always CSV)- With
--harmonize: additional_harmonized,_source,_confidence, and_ontology_idcolumns
Fields extracted
| Level | Fields |
|---|---|
| GSE | ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info |
| GSM | ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status |
Interactive Flow
$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.
Run on a subset first? [Y/n]: Y
Subset size [5]: 5
Processing 5/182 datasets...
[████████████████] 5/5 complete
Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)
Proceed with remaining 177 datasets? [Y/n]:
Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).
Resume
Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.
Disease Expansion
GEOtcha automatically expands disease keywords to capture related terms using two sources:
- Hand-curated aliases for common abbreviations (IBD, SLE, RA, COPD, etc.)
- DOID ontology subtypes — any disease in the 12,000-term Disease Ontology is auto-expanded with its subtypes (e.g., "breast cancer" also searches "breast carcinoma", "triple-negative breast cancer", etc.)
Examples:
- IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease + DOID subtypes
- breast cancer → breast cancer, breast carcinoma, male/female breast cancer, luminal breast carcinoma, ...
- melanoma → melanoma, skin melanoma, uveal melanoma, mucosal melanoma, ...
Short ambiguous abbreviations (UC, CD) are excluded from Entrez search queries but used for post-search relevance filtering.
Filtering
GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.
Single-cell filtering happens at two levels:
- Search level: eSummary title/summary scanned for scRNA-seq keywords
- Sample level: GSM
library_sourcechecked for "single cell"
To include single-cell datasets, use the --include-scrna flag:
geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna
Or set it in config:
geotcha config set include_scrna true
Harmonization
Three-tier harmonization pipeline (each layer is optional):
1. Rules (always on with --harmonize)
- Gender: male/M/man → "male"
- Age: "45 years", "45yo" → "45"
- Tissue: mapped to UBERON ontology (~4,000 terms from OBO Foundry)
- Disease: mapped to Disease Ontology (~12,000 DOID terms + common abbreviations like UC, IBD, COPD)
- Cell type: mapped to Cell Ontology (~3,000 CL terms)
- Treatment: ~300 drugs/stimuli with ChEBI IDs, brand name synonyms (e.g., Remicade → infliximab)
- Timepoint: "week 8", "W8" → "W8"
Five confidence tiers: exact (1.0) → synonym (0.85) → normalized-exact (0.80) → token-set overlap (0.75) → substring (0.70). Fields below the review threshold are flagged for manual review.
Extraction uses 40+ GEO characteristic keys (tissue, disease, cell type, treatment) and a source_name parser that splits concatenated metadata (e.g., "colon, ulcerative colitis, male, 45y") into structured fields.
Ontology mappings are shipped as JSON package data (src/geotcha/data/ontology/) with ~27,000 synonyms extracted from official OBO sources. To regenerate from upstream ontologies:
python scripts/build_ontologies.py
2. ML (--ml-mode hybrid or --ml-mode full)
- GLiNER-BioMed: zero-shot biomedical NER for disease, tissue, cell type, treatment, gender
- SapBERT + FAISS: entity linking to UBERON/DOID/CL/ChEBI ontology terms via pre-built FAISS indices
- Build indices once with
geotcha ml build-index, check status withgeotcha ml status - In
hybridmode, ML only fills fields where rules produced low confidence or no value - Low-confidence ML predictions flag records with
needs_review=True
3. LLM (--llm)
- Optional LLM-assisted harmonization for ambiguous free-text values
- Supports OpenAI, Anthropic, and Ollama providers
Each field tracks provenance: _harmonized, _source (rule/ml/llm), _confidence, and _ontology_id.
Documentation
Full documentation: install pip install geotcha[docs] and run mkdocs serve, or see the docs/ directory:
- Getting Started
- CLI Reference
- Python SDK
- Harmonization Guide
- Disease Packs
- ML & LLM
- Extending Ontologies
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geotcha-0.9.0.tar.gz.
File metadata
- Download URL: geotcha-0.9.0.tar.gz
- Upload date:
- Size: 713.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8dad7a6c6d74273bbf2f21329b043fcd41a215eb7d801e53145e08376042291
|
|
| MD5 |
62d077d93004a3a202240879fcc64a1b
|
|
| BLAKE2b-256 |
3d3e12ae5d15e981d12a9dde6380b0da6189cea7e9144a69cdc6439a3cf06745
|
Provenance
The following attestation bundles were made for geotcha-0.9.0.tar.gz:
Publisher:
release.yml on shantanubafna/GEOtcha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geotcha-0.9.0.tar.gz -
Subject digest:
a8dad7a6c6d74273bbf2f21329b043fcd41a215eb7d801e53145e08376042291 - Sigstore transparency entry: 1057506079
- Sigstore integration time:
-
Permalink:
shantanubafna/GEOtcha@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/shantanubafna
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d -
Trigger Event:
push
-
Statement type:
File details
Details for the file geotcha-0.9.0-py3-none-any.whl.
File metadata
- Download URL: geotcha-0.9.0-py3-none-any.whl
- Upload date:
- Size: 666.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8eb38aab4d060dffcb647a89a2c3dc01c77e672b2f449e989c23ef6078088e8d
|
|
| MD5 |
595b73481ed1998cd6fa8083637d1762
|
|
| BLAKE2b-256 |
b63e23829cc2812b962a0fc542f922b538c201b9213ad71aa9151c88e697e6fe
|
Provenance
The following attestation bundles were made for geotcha-0.9.0-py3-none-any.whl:
Publisher:
release.yml on shantanubafna/GEOtcha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geotcha-0.9.0-py3-none-any.whl -
Subject digest:
8eb38aab4d060dffcb647a89a2c3dc01c77e672b2f449e989c23ef6078088e8d - Sigstore transparency entry: 1057506087
- Sigstore integration time:
-
Permalink:
shantanubafna/GEOtcha@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/shantanubafna
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d -
Trigger Event:
push
-
Statement type: