Skip to main content

Extract and harmonize RNA-seq metadata from NCBI GEO

Project description

GEOtcha

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

  • gse_summary.csv — One row per GSE with series-level metadata
  • gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
  • With --harmonize: additional _harmonized columns with standardized values

Fields extracted

Level Fields
GSE ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms:

  • IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
  • SLE → systemic lupus erythematosus, lupus
  • RA → rheumatoid arthritis

Harmonization

Rule-based normalization for:

  • Gender: male/M/man → "male"
  • Age: "45 years", "45yo" → "45"
  • Tissue: mapped to UBERON ontology terms
  • Disease: mapped to Disease Ontology terms
  • Timepoint: "week 8", "W8" → "W8"

Optional LLM-assisted harmonization (--llm) for ambiguous free-text values.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.1.0.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geotcha-0.1.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file geotcha-0.1.0.tar.gz.

File metadata

  • Download URL: geotcha-0.1.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f3c8d8f62169fc969fb470570aab529e2f0dfb81321ebda20b4666e2a9f0b4cd
MD5 1aa52a0aa51e48a9cd946c90d189a5db
BLAKE2b-256 d75a0d577b280bd8e9cecf40763daf842f2657ff2997cc3891e10717368a6712

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.1.0.tar.gz:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geotcha-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: geotcha-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 23e91e6ac770e6b68f8228ae9e3297553678d630b40332e0e7767ace60563beb
MD5 f8cb8fe46875e9e437a0b74218bc74c8
BLAKE2b-256 c672e49b95b5f15a845b0555eb91f95ca16006d54e093a27a9a120feddf4b0ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.1.0-py3-none-any.whl:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page