Skip to main content

Extract and harmonize RNA-seq metadata from NCBI GEO

Project description

GEOtcha

CI PyPI Python License: MIT

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

  • gse_summary.csv — One row per GSE with series-level metadata
  • gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
  • With --harmonize: additional _harmonized columns with standardized values

Fields extracted

Level Fields
GSE ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms:

  • IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
  • SLE → systemic lupus erythematosus, lupus
  • RA → rheumatoid arthritis

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

  • Search level: eSummary title/summary scanned for scRNA-seq keywords
  • Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Rule-based normalization for:

  • Gender: male/M/man → "male"
  • Age: "45 years", "45yo" → "45"
  • Tissue: mapped to UBERON ontology terms
  • Disease: mapped to Disease Ontology terms
  • Timepoint: "week 8", "W8" → "W8"

Optional LLM-assisted harmonization (--llm) for ambiguous free-text values.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.1.1.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geotcha-0.1.1-py3-none-any.whl (35.8 kB view details)

Uploaded Python 3

File details

Details for the file geotcha-0.1.1.tar.gz.

File metadata

  • Download URL: geotcha-0.1.1.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.1.1.tar.gz
Algorithm Hash digest
SHA256 dba0be0ba40b2d281bbaa2bb3315ddf681b9fcd3db6b13a9c90d8467e30fd54e
MD5 7451c946d665e4da66d5efb56888488f
BLAKE2b-256 958553bc96baff0d30eb2af19ebadcacff763efe5c83099ab542d3cb8c405d66

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.1.1.tar.gz:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geotcha-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: geotcha-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 35.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 13b3961e4533d6b3181233a6931d0a43b4b95c15db89d7668251761dbc688e96
MD5 1b06e10bd771c9531a4fde4c91804940
BLAKE2b-256 be04dbc778c8c6aa3fe8d0093bc1a9c427d13ac45f9de0adb7cf9e1768446a44

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.1.1-py3-none-any.whl:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page