Extract and harmonize RNA-seq metadata from NCBI GEO

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

shantanubafna

These details have not been verified by PyPI

Project description

GEOtcha

Extract and harmonize RNA-seq metadata from NCBI GEO.

GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.

Installation

pip install geotcha

With optional extras:

pip install geotcha[ml]       # ML harmonization (GLiNER + SapBERT)
pip install geotcha[parquet]  # Parquet export support
pip install geotcha[llm]      # LLM harmonization

For development:

git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"

Quick Start

Search for datasets

geotcha search "inflammatory bowel disease"

Full pipeline with subset testing

geotcha run "IBD" --subset 5 --output ./results/ --harmonize

Extract from specific GSE IDs

geotcha extract GSE12345 GSE67890 --output ./results/

Output format selection

# CSV (default), TSV, or Parquet
geotcha run "IBD" --harmonize --format csv
geotcha extract GSE12345 -f tsv
geotcha run "IBD" -f parquet --output ./results/

Parquet requires pip install geotcha[parquet].

Include single-cell RNA-seq datasets (excluded by default)

geotcha run "IBD" --subset 5 --include-scrna

With ML harmonization (zero-shot NER)

pip install geotcha[ml]
geotcha run "IBD" --harmonize --ml-mode hybrid

ML fills in missing or low-confidence fields using GLiNER biomedical NER. Use --ml-mode full to let ML run on all fields.

With LLM harmonization

pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic

Combined: rules + ML + LLM

pip install "geotcha[ml,llm]"
geotcha run "IBD" --harmonize --ml-mode hybrid --llm

The harmonization chain runs in order: rules → ML → LLM. Each layer only upgrades fields that are still missing or low-confidence.

Structured JSON logging

geotcha run "IBD" --log-json --output ./results/
geotcha extract GSE12345 --log-json

Emits structured JSON log lines to stderr — useful for log aggregation in production pipelines.

Run report

# After a pipeline run completes, view a summary:
geotcha report <run_id>

# Write report.json to a custom directory:
geotcha report <run_id> --output ./reports/

Prints run metadata (query, ID counts, failures, stage timings) and writes a report.json file.

CI / non-interactive mode

geotcha run "IBD" --non-interactive --output ./results/
geotcha run "IBD" --yes --subset 10 --harmonize

Python SDK

from geotcha import GEOtchaClient

client = GEOtchaClient(ncbi_api_key="...")
ids = client.search("ulcerative colitis")
records = client.extract(ids[:5])
records = client.harmonize(records, ml_mode="hybrid")
client.export(records, output_dir="./results", fmt="parquet")

The SDK has no Typer/Rich dependency — safe for notebooks, scripts, and downstream pipelines. Failed GSE parses are silently skipped.

Configuration

# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"

# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"

# View current configuration
geotcha config show

# Validate configuration
geotcha config validate

Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.

Output

GEOtcha produces:

gse_summary.csv — One row per GSE with series-level metadata (or .tsv / .parquet with --format)
gsm/<GSE_ID>_samples.csv — Per-GSE file with sample-level metadata
manifest.json (in run state dir) — Audit trail: run_id, query, timestamps, stage timings, counts, masked settings
review_queue.csv — Low-confidence harmonized fields flagged for manual review (always CSV)
With --harmonize: additional _harmonized, _source, _confidence, and _ontology_id columns

Fields extracted

Level	Fields
GSE	ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info
GSM	ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status

Interactive Flow

$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.

Run on a subset first? [Y/n]: Y
Subset size [5]: 5

Processing 5/182 datasets...
 [████████████████] 5/5 complete

Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)

Proceed with remaining 177 datasets? [Y/n]:

Use --yes or --non-interactive to skip all prompts (useful for CI and scripted workflows).

Resume

Interrupted runs can be resumed with geotcha resume <run_id>. Resume correctly merges previously extracted rows in gse_summary.csv with newly extracted records, deduplicating by gse_id.

Disease Expansion

GEOtcha automatically expands disease keywords to capture related terms:

IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
SLE → systemic lupus erythematosus, lupus
RA → rheumatoid arthritis

Filtering

GEOtcha automatically filters search results to human RNA-seq datasets. By default, single-cell RNA-seq datasets are excluded (scRNA-seq, snRNA-seq, 10x Genomics, Drop-seq, etc.) since most bulk RNA-seq meta-analyses don't want these mixed in.

Single-cell filtering happens at two levels:

Search level: eSummary title/summary scanned for scRNA-seq keywords
Sample level: GSM library_source checked for "single cell"

To include single-cell datasets, use the --include-scrna flag:

geotcha run "IBD" --include-scrna
geotcha extract GSE12345 --include-scrna

Or set it in config:

geotcha config set include_scrna true

Harmonization

Three-tier harmonization pipeline (each layer is optional):

1. Rules (always on with `--harmonize`)

Gender: male/M/man → "male"
Age: "45 years", "45yo" → "45"
Tissue: mapped to UBERON ontology terms
Disease: mapped to Disease Ontology terms
Timepoint: "week 8", "W8" → "W8"

2. ML (`--ml-mode hybrid` or `--ml-mode full`)

GLiNER-BioMed: zero-shot biomedical NER for disease, tissue, cell type, treatment, gender
SapBERT: entity linking to UBERON/DOID ontology terms (scaffold — index building deferred)
In hybrid mode, ML only fills fields where rules produced low confidence or no value
Low-confidence ML predictions flag records with needs_review=True

3. LLM (`--llm`)

Optional LLM-assisted harmonization for ambiguous free-text values
Supports OpenAI, Anthropic, and Ollama providers

Each field tracks provenance: _harmonized, _source (rule/ml/llm), _confidence, and _ontology_id.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

shantanubafna

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.9.0

Mar 7, 2026

0.8.0

Mar 7, 2026

This version

0.6.1

Mar 5, 2026

0.6.0

Mar 5, 2026

0.1.1

Feb 22, 2026

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geotcha-0.6.1.tar.gz (63.7 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geotcha-0.6.1-py3-none-any.whl (50.5 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file geotcha-0.6.1.tar.gz.

File metadata

Download URL: geotcha-0.6.1.tar.gz
Upload date: Mar 5, 2026
Size: 63.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`5f51954a6d616c1e8dc5d767cbd2d119d1322231202ab024e1c94b2547a4b5c9`
MD5	`0eaba80efb2becbd205bd439daed238e`
BLAKE2b-256	`30bd1fab112d85c193d17a31af1c98030307c19aed5b7fe4f428e6c066b2cee8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.6.1.tar.gz:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geotcha-0.6.1.tar.gz
- Subject digest: 5f51954a6d616c1e8dc5d767cbd2d119d1322231202ab024e1c94b2547a4b5c9
- Sigstore transparency entry: 1038404907
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: shantanubafna/GEOtcha@284a0e18a49898f41ab4ff7dfa2ecb2834f03d24
- Branch / Tag: refs/tags/v0.6.1
- Owner: https://github.com/shantanubafna
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@284a0e18a49898f41ab4ff7dfa2ecb2834f03d24
- Trigger Event: push

File details

Details for the file geotcha-0.6.1-py3-none-any.whl.

File metadata

Download URL: geotcha-0.6.1-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 50.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geotcha-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a3c314a83554d0b7bd1581a13d70527456cf646f1a12f1528cff67a8794af6c`
MD5	`4656c00b613ffec1c03782a74a9699de`
BLAKE2b-256	`f0cda145d91a42541f560b6596b4a6f8f55dd68385d2b74fdf32846f094dcbd3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geotcha-0.6.1-py3-none-any.whl:

Publisher: release.yml on shantanubafna/GEOtcha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geotcha-0.6.1-py3-none-any.whl
- Subject digest: 3a3c314a83554d0b7bd1581a13d70527456cf646f1a12f1528cff67a8794af6c
- Sigstore transparency entry: 1038404970
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: shantanubafna/GEOtcha@284a0e18a49898f41ab4ff7dfa2ecb2834f03d24
- Branch / Tag: refs/tags/v0.6.1
- Owner: https://github.com/shantanubafna
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@284a0e18a49898f41ab4ff7dfa2ecb2834f03d24
- Trigger Event: push

geotcha 0.6.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GEOtcha

Installation

Quick Start

Search for datasets

Full pipeline with subset testing

Extract from specific GSE IDs

Output format selection

Include single-cell RNA-seq datasets (excluded by default)

With ML harmonization (zero-shot NER)

With LLM harmonization

Combined: rules + ML + LLM

Structured JSON logging

Run report

CI / non-interactive mode

Python SDK

Configuration

Output

Fields extracted

Interactive Flow

Resume

Disease Expansion

Filtering

Harmonization

1. Rules (always on with --harmonize)

2. ML (--ml-mode hybrid or --ml-mode full)

3. LLM (--llm)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. Rules (always on with `--harmonize`)

2. ML (`--ml-mode hybrid` or `--ml-mode full`)

3. LLM (`--llm`)