Skip to main content

Extract and harmonize RNA-seq metadata from NCBI GEO

Project description

GEOtcha

CI PyPI Python License: MIT

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

CI / non-interactive mode

geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

  • gse_summary.csv — One row per GSE with series-level metadata
  • gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
  • manifest.json (in run state dir) — Audit trail: run_id, query, timestamps, counts, masked settings
  • With --harmonize: additional _harmonized columns with standardized values

Fields extracted

Level Fields
GSE ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).

Resume

Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms:

  • IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
  • SLE → systemic lupus erythematosus, lupus
  • RA → rheumatoid arthritis

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

  • Search level: eSummary title/summary scanned for scRNA-seq keywords
  • Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Rule-based normalization for:

  • Gender: male/M/man → "male"
  • Age: "45 years", "45yo" → "45"
  • Tissue: mapped to UBERON ontology terms
  • Disease: mapped to Disease Ontology terms
  • Timepoint: "week 8", "W8" → "W8"

Optional LLM-assisted harmonization (--llm) for ambiguous free-text values.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.6.0.tar.gz (56.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geotcha-0.6.0-py3-none-any.whl (47.4 kB view details)

Uploaded Python 3

File details

Details for the file geotcha-0.6.0.tar.gz.

File metadata

  • Download URL: geotcha-0.6.0.tar.gz
  • Upload date:
  • Size: 56.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for geotcha-0.6.0.tar.gz
Algorithm Hash digest
SHA256 42c4f9fe78c26916431e87c5bce9027227be8e6087bb453269d55addbdcf9da1
MD5 46a53acd98eef9c8e4bb86c7cc14e57b
BLAKE2b-256 5adc3269ae865b220692b14e2014e5b4122e8a38df39603ed93cf240f1f98d16

See more details on using hashes here.

File details

Details for the file geotcha-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: geotcha-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 47.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for geotcha-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 199de8f43f0ac3b7f8003ad51e6215c62e309f7ea901b65a854c6998ab031a2b
MD5 3d52dd75550bf98b0d7346da13e75f81
BLAKE2b-256 5ff781f9775d73f9b9465773a50c67b04e161fb4ff08ffc8ddbde0095738968d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page