Skip to main content

Extract and harmonize RNA-seq metadata from NCBI GEO

Project description

GEOtcha

CI PyPI Python License: MIT

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

With optional extras:

pip install geotcha[ml]       # ML harmonization (GLiNER + SapBERT)
pip install geotcha[parquet]  # Parquet export support
pip install geotcha[llm]      # LLM harmonization

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Output format selection

# CSV (default), TSV, or Parquet
geotcha run "IBD" --harmonize --format csv
geotcha extract GSE12345 -f tsv
geotcha run "IBD" -f parquet --output ./results/

Parquet requires pip install geotcha[parquet].

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

With ML harmonization (zero-shot NER)

pip install geotcha[ml]
geotcha run "IBD" --harmonize --ml-mode hybrid

ML fills in missing or low-confidence fields using GLiNER biomedical NER. Use --ml-mode full to let ML run on all fields.

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

Combined: rules + ML + LLM

pip install "geotcha[ml,llm]"
geotcha run "IBD" --harmonize --ml-mode hybrid --llm

The harmonization chain runs in order: rules → ML → LLM. Each layer only upgrades fields that are still missing or low-confidence.

Structured JSON logging

geotcha run "IBD" --log-json --output ./results/
geotcha extract GSE12345 --log-json

Emits structured JSON log lines to stderr — useful for log aggregation in production pipelines.

Run report

# After a pipeline run completes, view a summary:
geotcha report <run_id>

# Write report.json to a custom directory:
geotcha report <run_id> --output ./reports/

Prints run metadata (query, ID counts, failures, stage timings) and writes a report.json file.

CI / non-interactive mode

geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize

Python SDK

from geotcha import GEOtchaClient

client = GEOtchaClient(ncbi_api_key="...")
ids = client.search("ulcerative colitis")
records = client.extract(ids[:5])
records = client.harmonize(records, ml_mode="hybrid")
client.export(records, output_dir="./results", fmt="parquet")

The SDK has no Typer/Rich dependency — safe for notebooks, scripts, and downstream pipelines. Failed GSE parses are silently skipped.

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

# Validate configuration
geotcha config validate

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

  • gse_summary.csv — One row per GSE with series-level metadata (or .tsv / .parquet with --format)
  • gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
  • manifest.json (in run state dir) — Audit trail: run_id, query, timestamps, stage timings, counts, masked settings
  • review_queue.csv — Low-confidence harmonized fields flagged for manual review (always CSV)
  • With --harmonize: additional _harmonized, _source, _confidence, and _ontology_id columns

Fields extracted

Level Fields
GSE ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).

Resume

Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms:

  • IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
  • SLE → systemic lupus erythematosus, lupus
  • RA → rheumatoid arthritis

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

  • Search level: eSummary title/summary scanned for scRNA-seq keywords
  • Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Three-tier harmonization pipeline (each layer is optional):

1. Rules (always on with --harmonize)

  • Gender: male/M/man → "male"
  • Age: "45 years", "45yo" → "45"
  • Tissue: mapped to UBERON ontology terms
  • Disease: mapped to Disease Ontology terms
  • Timepoint: "week 8", "W8" → "W8"

2. ML (--ml-mode hybrid or --ml-mode full)

  • GLiNER-BioMed: zero-shot biomedical NER for disease, tissue, cell type, treatment, gender
  • SapBERT: entity linking to UBERON/DOID ontology terms (scaffold — index building deferred)
  • In hybrid mode, ML only fills fields where rules produced low confidence or no value
  • Low-confidence ML predictions flag records with needs_review=True

3. LLM (--llm)

  • Optional LLM-assisted harmonization for ambiguous free-text values
  • Supports OpenAI, Anthropic, and Ollama providers

Each field tracks provenance: _harmonized, _source (rule/ml/llm), _confidence, and _ontology_id.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.6.1.tar.gz (63.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geotcha-0.6.1-py3-none-any.whl (50.5 kB view details)

Uploaded Python 3

File details

Details for the file geotcha-0.6.1.tar.gz.

File metadata

  • Download URL: geotcha-0.6.1.tar.gz
  • Upload date:
  • Size: 63.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.6.1.tar.gz
Algorithm Hash digest
SHA256 5f51954a6d616c1e8dc5d767cbd2d119d1322231202ab024e1c94b2547a4b5c9
MD5 0eaba80efb2becbd205bd439daed238e
BLAKE2b-256 30bd1fab112d85c193d17a31af1c98030307c19aed5b7fe4f428e6c066b2cee8

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.6.1.tar.gz:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geotcha-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: geotcha-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 50.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3a3c314a83554d0b7bd1581a13d70527456cf646f1a12f1528cff67a8794af6c
MD5 4656c00b613ffec1c03782a74a9699de
BLAKE2b-256 f0cda145d91a42541f560b6596b4a6f8f55dd68385d2b74fdf32846f094dcbd3

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.6.1-py3-none-any.whl:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page