Extract and harmonize RNA-seq metadata from NCBI GEO

These details have not been verified by PyPI

Project links

Project description

GEOtcha

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

CI / non-interactive mode

geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

gse_summary.csv — One row per GSE with series-level metadata
gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
manifest.json (in run state dir) — Audit trail: run_id, query, timestamps, counts, masked settings
With --harmonize: additional _harmonized columns with standardized values

Fields extracted

Level	Fields
GSE	ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM	ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).

Resume

Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms:

IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
SLE → systemic lupus erythematosus, lupus
RA → rheumatoid arthritis

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

Search level: eSummary title/summary scanned for scRNA-seq keywords
Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Rule-based normalization for:

Gender: male/M/man → "male"
Age: "45 years", "45yo" → "45"
Tissue: mapped to UBERON ontology terms
Disease: mapped to Disease Ontology terms
Timepoint: "week 8", "W8" → "W8"

Optional LLM-assisted harmonization (--llm) for ambiguous free-text values.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.0

Mar 7, 2026

0.8.0

Mar 7, 2026

0.6.1

Mar 5, 2026

This version

0.6.0

Mar 5, 2026

0.1.1

Feb 22, 2026

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.6.0.tar.gz (56.5 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geotcha-0.6.0-py3-none-any.whl (47.4 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file geotcha-0.6.0.tar.gz.

File metadata

Download URL: geotcha-0.6.0.tar.gz
Upload date: Mar 5, 2026
Size: 56.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for geotcha-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`42c4f9fe78c26916431e87c5bce9027227be8e6087bb453269d55addbdcf9da1`
MD5	`46a53acd98eef9c8e4bb86c7cc14e57b`
BLAKE2b-256	`5adc3269ae865b220692b14e2014e5b4122e8a38df39603ed93cf240f1f98d16`

See more details on using hashes here.

File details

Details for the file geotcha-0.6.0-py3-none-any.whl.

File metadata

Download URL: geotcha-0.6.0-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 47.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for geotcha-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`199de8f43f0ac3b7f8003ad51e6215c62e309f7ea901b65a854c6998ab031a2b`
MD5	`3d52dd75550bf98b0d7346da13e75f81`
BLAKE2b-256	`5ff781f9775d73f9b9465773a50c67b04e161fb4ff08ffc8ddbde0095738968d`

See more details on using hashes here.

geotcha 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GEOtcha

Installation

Quick Start

Search for datasets

Full pipeline with subset testing

Extract from specific GSE IDs

Include single-cell RNA-seq datasets (excluded by default)

With LLM harmonization

CI / non-interactive mode

Configuration

Output

Fields extracted

Interactive Flow

Resume

Disease Expansion

Filtering

Harmonization

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes