Extract and harmonize RNA-seq metadata from NCBI GEO

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

shantanubafna

These details have not been verified by PyPI

Project description

GEOtcha

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

With optional extras:

pip install geotcha[ml]       # ML harmonization (GLiNER + SapBERT + FAISS)
pip install geotcha[parquet]  # Parquet export support
pip install geotcha[llm]      # LLM harmonization

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Output format selection

# CSV (default), TSV, or Parquet
geotcha run "IBD" --harmonize --format csv
geotcha extract GSE12345 -f tsv
geotcha run "IBD" -f parquet --output ./results/

Parquet requires pip install geotcha[parquet].

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

Disease packs

# List available packs
geotcha packs

# Use a pack for optimized search
geotcha run "IBD" --pack ibd --harmonize
geotcha run "breast cancer" --pack oncology --harmonize

Available packs: ibd, oncology, neurodegeneration, autoimmune, metabolic.

With ML harmonization (zero-shot NER + entity linking)

pip install geotcha[ml]

# Build FAISS ontology indices (one-time)
geotcha ml build-index

# Run with ML
geotcha run "IBD" --harmonize --ml-mode hybrid

ML fills in missing or low-confidence fields using GLiNER biomedical NER and SapBERT entity linking. Use --ml-mode full to let ML run on all fields.

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

Combined: rules + ML + LLM

pip install "geotcha[ml,llm]"
geotcha run "IBD" --harmonize --ml-mode hybrid --llm

The harmonization chain runs in order: rules → ML → LLM. Each layer only upgrades fields that are still missing or low-confidence.

Structured JSON logging

geotcha run "IBD" --log-json --output ./results/
geotcha extract GSE12345 --log-json

Emits structured JSON log lines to stderr — useful for log aggregation in production pipelines.

Benchmark harmonization quality

# Run against bundled fixtures (100 curated datasets)
geotcha benchmark

# Custom fixtures and output
geotcha benchmark --input ./my_fixtures/ --output ./report.json

# Benchmark with ML enabled
geotcha benchmark --ml-mode hybrid

Produces a JSON report with per-field exact match, completeness, ontology coverage, and confidence metrics.

Run report

# After a pipeline run completes, view a summary:
geotcha report <run_id>

# Write report.json to a custom directory:
geotcha report <run_id> --output ./reports/

Prints run metadata (query, ID counts, failures, stage timings) and writes a report.json file.

CI / non-interactive mode

geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize

Python SDK

from geotcha import GEOtchaClient

client = GEOtchaClient(ncbi_api_key="...")
ids = client.search("ulcerative colitis")
records = client.extract(ids[:5])
records = client.harmonize(records, ml_mode="hybrid")
client.export(records, output_dir="./results", fmt="parquet")

# Benchmark harmonization quality
report = client.benchmark()
print(report["summary"]["overall_exact_match"])

The SDK has no Typer/Rich dependency — safe for notebooks, scripts, and downstream pipelines. Failed GSE parses are silently skipped.

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

# Validate configuration
geotcha config validate

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

gse_summary.csv — One row per GSE with series-level metadata (or .tsv / .parquet with --format)
gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
manifest.json (in run state dir) — Audit trail: run_id, query, timestamps, stage timings, counts, masked settings
review_queue.csv — Low-confidence harmonized fields flagged for manual review (always CSV)
With --harmonize: additional _harmonized, _source, _confidence, and _ontology_id columns

Fields extracted

Level	Fields
GSE	ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM	ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).

Resume

Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms using two sources:

Hand-curated aliases for common abbreviations (IBD, SLE, RA, COPD, etc.)
DOID ontology subtypes — any disease in the 12,000-term Disease Ontology is auto-expanded with its subtypes (e.g., "breast cancer" also searches "breast carcinoma", "triple-negative breast cancer", etc.)

Examples:

IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease + DOID subtypes
breast cancer → breast cancer, breast carcinoma, male/female breast cancer, luminal breast carcinoma, ...
melanoma → melanoma, skin melanoma, uveal melanoma, mucosal melanoma, ...

Short ambiguous abbreviations (UC, CD) are excluded from Entrez search queries but used for post-search relevance filtering.

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

Search level: eSummary title/summary scanned for scRNA-seq keywords
Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Three-tier harmonization pipeline (each layer is optional):

1. Rules (always on with `--harmonize`)

Gender: male/M/man → "male"
Age: "45 years", "45yo" → "45"
Tissue: mapped to UBERON ontology (~4,000 terms from OBO Foundry)
Disease: mapped to Disease Ontology (~12,000 DOID terms + common abbreviations like UC, IBD, COPD)
Cell type: mapped to Cell Ontology (~3,000 CL terms)
Treatment: ~300 drugs/stimuli with ChEBI IDs, brand name synonyms (e.g., Remicade → infliximab)
Timepoint: "week 8", "W8" → "W8"

Five confidence tiers: exact (1.0) → synonym (0.85) → normalized-exact (0.80) → token-set overlap (0.75) → substring (0.70). Fields below the review threshold are flagged for manual review.

Extraction uses 40+ GEO characteristic keys (tissue, disease, cell type, treatment) and a source_name parser that splits concatenated metadata (e.g., "colon, ulcerative colitis, male, 45y") into structured fields.

Ontology mappings are shipped as JSON package data (src/geotcha/data/ontology/) with ~27,000 synonyms extracted from official OBO sources. To regenerate from upstream ontologies:

python scripts/build_ontologies.py

2. ML (`--ml-mode hybrid` or `--ml-mode full`)

GLiNER-BioMed: zero-shot biomedical NER for disease, tissue, cell type, treatment, gender
SapBERT + FAISS: entity linking to UBERON/DOID/CL/ChEBI ontology terms via pre-built FAISS indices
Build indices once with geotcha ml build-index, check status with geotcha ml status
In hybrid mode, ML only fills fields where rules produced low confidence or no value
Low-confidence ML predictions flag records with needs_review=True

3. LLM (`--llm`)

Optional LLM-assisted harmonization for ambiguous free-text values
Supports OpenAI, Anthropic, and Ollama providers

Each field tracks provenance: _harmonized, _source (rule/ml/llm), _confidence, and _ontology_id.

Documentation

Full documentation: install pip install geotcha[docs] and run mkdocs serve, or see the docs/ directory:

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

shantanubafna

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.9.0

Mar 7, 2026

0.8.0

Mar 7, 2026

0.6.1

Mar 5, 2026

0.6.0

Mar 5, 2026

0.1.1

Feb 22, 2026

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.9.0.tar.gz (713.3 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geotcha-0.9.0-py3-none-any.whl (666.0 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file geotcha-0.9.0.tar.gz.

File metadata

Download URL: geotcha-0.9.0.tar.gz
Upload date: Mar 7, 2026
Size: 713.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`a8dad7a6c6d74273bbf2f21329b043fcd41a215eb7d801e53145e08376042291`
MD5	`62d077d93004a3a202240879fcc64a1b`
BLAKE2b-256	`3d3e12ae5d15e981d12a9dde6380b0da6189cea7e9144a69cdc6439a3cf06745`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.9.0.tar.gz:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geotcha-0.9.0.tar.gz
- Subject digest: a8dad7a6c6d74273bbf2f21329b043fcd41a215eb7d801e53145e08376042291
- Sigstore transparency entry: 1057506079
- Sigstore integration time: Mar 7, 2026
Source repository:
- Permalink: shantanubafna/GEOtcha@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d
- Branch / Tag: refs/tags/v0.9.0
- Owner: https://github.com/shantanubafna
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d
- Trigger Event: push

File details

Details for the file geotcha-0.9.0-py3-none-any.whl.

File metadata

Download URL: geotcha-0.9.0-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 666.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8eb38aab4d060dffcb647a89a2c3dc01c77e672b2f449e989c23ef6078088e8d`
MD5	`595b73481ed1998cd6fa8083637d1762`
BLAKE2b-256	`b63e23829cc2812b962a0fc542f922b538c201b9213ad71aa9151c88e697e6fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.9.0-py3-none-any.whl:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geotcha-0.9.0-py3-none-any.whl
- Subject digest: 8eb38aab4d060dffcb647a89a2c3dc01c77e672b2f449e989c23ef6078088e8d
- Sigstore transparency entry: 1057506087
- Sigstore integration time: Mar 7, 2026
Source repository:
- Permalink: shantanubafna/GEOtcha@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d
- Branch / Tag: refs/tags/v0.9.0
- Owner: https://github.com/shantanubafna
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ad82c4945b0e4b32915eb2df8891dfd3ee450e1d
- Trigger Event: push

geotcha 0.9.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GEOtcha

Installation

Quick Start

Search for datasets

Full pipeline with subset testing

Extract from specific GSE IDs

Output format selection

Include single-cell RNA-seq datasets (excluded by default)

Disease packs

With ML harmonization (zero-shot NER + entity linking)

With LLM harmonization

Combined: rules + ML + LLM

Structured JSON logging

Benchmark harmonization quality

Run report

CI / non-interactive mode

Python SDK

Configuration

Output

Fields extracted

Interactive Flow

Resume

Disease Expansion

Filtering

Harmonization

1. Rules (always on with --harmonize)

2. ML (--ml-mode hybrid or --ml-mode full)

3. LLM (--llm)

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. Rules (always on with `--harmonize`)

2. ML (`--ml-mode hybrid` or `--ml-mode full`)

3. LLM (`--llm`)