Extract and harmonize RNA-seq metadata from NCBI GEO
Project description
GEOtcha
Extract and harmonize RNA-seq metadata from NCBI GEO.
GEOtcha is a CLI tool that helps researchers search GEO by disease keyword, filter to human RNA-seq datasets, extract structured metadata at both series (GSE) and sample (GSM) levels, and harmonize the results into standardized output files.
Installation
pip install geotcha
For development:
git clone https://github.com/shantanubafna/GEOtcha.git
cd geotcha
pip install -e ".[dev]"
Quick Start
Search for datasets
geotcha search "inflammatory bowel disease"
Full pipeline with subset testing
geotcha run "IBD" --subset 5 --output ./results/ --harmonize
Extract from specific GSE IDs
geotcha extract GSE12345 GSE67890 --output ./results/
With LLM harmonization
pip install geotcha[llm]
geotcha run "IBD" --harmonize --llm --llm-provider anthropic
Configuration
# Set your NCBI API key (recommended for higher rate limits)
geotcha config set ncbi_api_key "YOUR_KEY"
# Set your email for NCBI Entrez
geotcha config set ncbi_email "you@example.com"
# View current configuration
geotcha config show
Configuration priority: CLI flags > environment variables (GEOTCHA_*) > config file (~/.config/geotcha/config.toml) > defaults.
Output
GEOtcha produces:
gse_summary.csv— One row per GSE with series-level metadatagsm/<GSE_ID>_samples.csv— Per-GSE file with sample-level metadata- With
--harmonize: additional_harmonizedcolumns with standardized values
Fields extracted
| Level | Fields |
|---|---|
| GSE | ID, URL, title, organism, experiment type, platform, sample counts, PubMed links, tissue, disease, treatment, timepoint, gender, age, responder info |
| GSM | ID, title, source, organism, platform, instrument, library strategy, tissue, cell type, disease, gender, age, treatment, timepoint, responder status |
Interactive Flow
$ geotcha run "IBD"
Searching GEO for: IBD, ulcerative colitis, Crohn's disease...
Found 347 datasets. After filtering (Homo sapiens + RNA-seq): 182 datasets.
Run on a subset first? [Y/n]: Y
Subset size [5]: 5
Processing 5/182 datasets...
[████████████████] 5/5 complete
Results: ./output/gse_summary.csv (5 rows), ./output/gsm/ (5 files)
Proceed with remaining 177 datasets? [Y/n]:
Disease Expansion
GEOtcha automatically expands disease keywords to capture related terms:
- IBD → inflammatory bowel disease, ulcerative colitis, Crohn's disease (abbreviations like UC, CD are used for relevance filtering)
- SLE → systemic lupus erythematosus, lupus
- RA → rheumatoid arthritis
Harmonization
Rule-based normalization for:
- Gender: male/M/man → "male"
- Age: "45 years", "45yo" → "45"
- Tissue: mapped to UBERON ontology terms
- Disease: mapped to Disease Ontology terms
- Timepoint: "week 8", "W8" → "W8"
Optional LLM-assisted harmonization (--llm) for ambiguous free-text values.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geotcha-0.1.0.tar.gz.
File metadata
- Download URL: geotcha-0.1.0.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3c8d8f62169fc969fb470570aab529e2f0dfb81321ebda20b4666e2a9f0b4cd
|
|
| MD5 |
1aa52a0aa51e48a9cd946c90d189a5db
|
|
| BLAKE2b-256 |
d75a0d577b280bd8e9cecf40763daf842f2657ff2997cc3891e10717368a6712
|
Provenance
The following attestation bundles were made for geotcha-0.1.0.tar.gz:
Publisher:
release.yml on shantanubafna/GEOtcha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geotcha-0.1.0.tar.gz -
Subject digest:
f3c8d8f62169fc969fb470570aab529e2f0dfb81321ebda20b4666e2a9f0b4cd - Sigstore transparency entry: 976487718
- Sigstore integration time:
-
Permalink:
shantanubafna/GEOtcha@daff2fbebb5ad151fa58f7a1577939e87cba68ed -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/shantanubafna
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@daff2fbebb5ad151fa58f7a1577939e87cba68ed -
Trigger Event:
push
-
Statement type:
File details
Details for the file geotcha-0.1.0-py3-none-any.whl.
File metadata
- Download URL: geotcha-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23e91e6ac770e6b68f8228ae9e3297553678d630b40332e0e7767ace60563beb
|
|
| MD5 |
f8cb8fe46875e9e437a0b74218bc74c8
|
|
| BLAKE2b-256 |
c672e49b95b5f15a845b0555eb91f95ca16006d54e093a27a9a120feddf4b0ff
|
Provenance
The following attestation bundles were made for geotcha-0.1.0-py3-none-any.whl:
Publisher:
release.yml on shantanubafna/GEOtcha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geotcha-0.1.0-py3-none-any.whl -
Subject digest:
23e91e6ac770e6b68f8228ae9e3297553678d630b40332e0e7767ace60563beb - Sigstore transparency entry: 976487722
- Sigstore integration time:
-
Permalink:
shantanubafna/GEOtcha@daff2fbebb5ad151fa58f7a1577939e87cba68ed -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/shantanubafna
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@daff2fbebb5ad151fa58f7a1577939e87cba68ed -
Trigger Event:
push
-
Statement type: