Skip to main content

Extract and harmonize RNA-seq metadata from NCBI GEO

Project description

GEOtcha

CI PyPI Python License: MIT

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

With optional extras:

pip install geotcha[ml]       # ML harmonization (GLiNER + SapBERT + FAISS)
pip install geotcha[parquet]  # Parquet export support
pip install geotcha[llm]      # LLM harmonization

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Output format selection

# CSV (default), TSV, or Parquet
geotcha run "IBD" --harmonize --format csv
geotcha extract GSE12345 -f tsv
geotcha run "IBD" -f parquet --output ./results/

Parquet requires pip install geotcha[parquet].

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

Disease packs

# List available packs
geotcha packs

# Use a pack for optimized search
geotcha run "IBD" --pack ibd --harmonize
geotcha run "breast cancer" --pack oncology --harmonize

Available packs: ibd, oncology, neurodegeneration, autoimmune, metabolic.

With ML harmonization (zero-shot NER + entity linking)

pip install geotcha[ml]

# Build FAISS ontology indices (one-time)
geotcha ml build-index

# Run with ML
geotcha run "IBD" --harmonize --ml-mode hybrid

ML fills in missing or low-confidence fields using GLiNER biomedical NER and SapBERT entity linking. Use --ml-mode full to let ML run on all fields.

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

Combined: rules + ML + LLM

pip install "geotcha[ml,llm]"
geotcha run "IBD" --harmonize --ml-mode hybrid --llm

The harmonization chain runs in order: rules → ML → LLM. Each layer only upgrades fields that are still missing or low-confidence.

Structured JSON logging

geotcha run "IBD" --log-json --output ./results/
geotcha extract GSE12345 --log-json

Emits structured JSON log lines to stderr — useful for log aggregation in production pipelines.

Benchmark harmonization quality

# Run against bundled fixtures (100 curated datasets)
geotcha benchmark

# Custom fixtures and output
geotcha benchmark --input ./my_fixtures/ --output ./report.json

# Benchmark with ML enabled
geotcha benchmark --ml-mode hybrid

Produces a JSON report with per-field exact match, completeness, ontology coverage, and confidence metrics.

Run report

# After a pipeline run completes, view a summary:
geotcha report <run_id>

# Write report.json to a custom directory:
geotcha report <run_id> --output ./reports/

Prints run metadata (query, ID counts, failures, stage timings) and writes a report.json file.

CI / non-interactive mode

geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize

Python SDK

from geotcha import GEOtchaClient

client = GEOtchaClient(ncbi_api_key="...")
ids = client.search("ulcerative colitis")
records = client.extract(ids[:5])
records = client.harmonize(records, ml_mode="hybrid")
client.export(records, output_dir="./results", fmt="parquet")

# Benchmark harmonization quality
report = client.benchmark()
print(report["summary"]["overall_exact_match"])

The SDK has no Typer/Rich dependency — safe for notebooks, scripts, and downstream pipelines. Failed GSE parses are silently skipped.

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

# Validate configuration
geotcha config validate

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

  • gse_summary.csv — One row per GSE with series-level metadata (or .tsv / .parquet with --format)
  • gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
  • manifest.json (in run state dir) — Audit trail: run_id, query, timestamps, stage timings, counts, masked settings
  • review_queue.csv — Low-confidence harmonized fields flagged for manual review (always CSV)
  • With --harmonize: additional _harmonized, _source, _confidence, and _ontology_id columns

Fields extracted

Level Fields
GSE ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).

Resume

Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms using two sources:

  1. Hand-curated aliases for common abbreviations (IBD, SLE, RA, COPD, etc.)
  2. DOID ontology subtypes — any disease in the 12,000-term Disease Ontology is auto-expanded with its subtypes (e.g., "breast cancer" also searches "breast carcinoma", "triple-negative breast cancer", etc.)

Examples:

  • IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease + DOID subtypes
  • breast cancer → breast cancer, breast carcinoma, male/female breast cancer, luminal breast carcinoma, ...
  • melanoma → melanoma, skin melanoma, uveal melanoma, mucosal melanoma, ...

Short ambiguous abbreviations (UC, CD) are excluded from Entrez search queries but used for post-search relevance filtering.

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

  • Search level: eSummary title/summary scanned for scRNA-seq keywords
  • Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Three-tier harmonization pipeline (each layer is optional):

1. Rules (always on with --harmonize)

  • Gender: male/M/man → "male"
  • Age: "45 years", "45yo" → "45"
  • Tissue: mapped to UBERON ontology (~4,000 terms from OBO Foundry)
  • Disease: mapped to Disease Ontology (~12,000 DOID terms + common abbreviations like UC, IBD, COPD)
  • Cell type: mapped to Cell Ontology (~3,000 CL terms)
  • Treatment: ~300 drugs/stimuli with ChEBI IDs, brand name synonyms (e.g., Remicade → infliximab)
  • Timepoint: "week 8", "W8" → "W8"

Five confidence tiers: exact (1.0) → synonym (0.85) → normalized-exact (0.80) → token-set overlap (0.75) → substring (0.70). Fields below the review threshold are flagged for manual review.

Extraction uses 40+ GEO characteristic keys (tissue, disease, cell type, treatment) and a source_name parser that splits concatenated metadata (e.g., "colon, ulcerative colitis, male, 45y") into structured fields.

Ontology mappings are shipped as JSON package data (src/geotcha/data/ontology/) with ~27,000 synonyms extracted from official OBO sources. To regenerate from upstream ontologies:

python scripts/build_ontologies.py

2. ML (--ml-mode hybrid or --ml-mode full)

  • GLiNER-BioMed: zero-shot biomedical NER for disease, tissue, cell type, treatment, gender
  • SapBERT + FAISS: entity linking to UBERON/DOID/CL/ChEBI ontology terms via pre-built FAISS indices
  • Build indices once with geotcha ml build-index, check status with geotcha ml status
  • In hybrid mode, ML only fills fields where rules produced low confidence or no value
  • Low-confidence ML predictions flag records with needs_review=True

3. LLM (--llm)

  • Optional LLM-assisted harmonization for ambiguous free-text values
  • Supports OpenAI, Anthropic, and Ollama providers

Each field tracks provenance: _harmonized, _source (rule/ml/llm), _confidence, and _ontology_id.

Documentation

Full documentation: install pip install geotcha[docs] and run mkdocs serve, or see the docs/ directory:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.9.0.tar.gz (713.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geotcha-0.9.0-py3-none-any.whl (666.0 kB view details)

Uploaded Python 3

File details

Details for the file geotcha-0.9.0.tar.gz.

File metadata

  • Download URL: geotcha-0.9.0.tar.gz
  • Upload date:
  • Size: 713.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.9.0.tar.gz
Algorithm Hash digest
SHA256 a8dad7a6c6d74273bbf2f21329b043fcd41a215eb7d801e53145e08376042291
MD5 62d077d93004a3a202240879fcc64a1b
BLAKE2b-256 3d3e12ae5d15e981d12a9dde6380b0da6189cea7e9144a69cdc6439a3cf06745

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.9.0.tar.gz:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geotcha-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: geotcha-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 666.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8eb38aab4d060dffcb647a89a2c3dc01c77e672b2f449e989c23ef6078088e8d
MD5 595b73481ed1998cd6fa8083637d1762
BLAKE2b-256 b63e23829cc2812b962a0fc542f922b538c201b9213ad71aa9151c88e697e6fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.9.0-py3-none-any.whl:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page