GeoVectorSearch is a lightweight Python SDK and command-line tool for semantic discovery of GEO datasets suitable for differential gene expression analysis. Powered by FAISS-based vector search and optional GPT-based filtering, it helps researchers and developers quickly identify relevant RNA-seq or microarray datasets.
Project description
🧬 GeoVectorSearch
GeoVectorSearch is a lightweight Python SDK and command-line tool for discovering high-quality GEO gene expression datasets relevant to a disease or biological condition — optimized for differential expression (DE) analysis.
It combines semantic search using sentence embeddings with optional GPT-based filtering to help you rapidly identify suitable datasets for your research or pipeline.
🔍 Features
- ✅ Natural language search for GEO datasets
- ⚡ Fast vector search using FAISS and prebuilt sentence embeddings
- 🧠 Optional GPT filtering to assess dataset quality for DE analysis - supports basic GPT filtering and enhanced GPT filtering which segregates the datasets into tiers: Tier 1: Highly suitable for DE studies, Tier 2: Suitable for DE studies but the samples come from cell lines/ organoids/ xenografts, and Tier 3: Not directly suitable for DE studies but can be used for exploratory studies
- 🧬 Supports microarray and RNA-seq datasets
- 🖥️ Interactive CLI for a smooth user experience
- 🧩 Easy to integrate into larger pipelines or SDKs
- 💾 Save results locally for downstream analysis
📦 Installation
Install using your preferred package manager:
uv pip install geo-pysearch
Or clone the repository and install locally:
git clone https://github.com/Tinfloz/geo-vector-search.git
cd geo-vector-search
uv pip install .
🧪 Example (Python SDK)
from geo_pysearch.sdk import search_datasets
results = search_datasets(
query="duchenne muscular dystrophy",
dataset_type="microarray",
gpt_filter_type="enhanced",
top_k=50,
use_gpt_filter=True,
return_all_gpt_results=True
)
print(results.head())
Convenience methods:
from geo_pysearch.sdk import search_microarray, search_rnaseq
search_microarray("breast cancer")
search_rnaseq("lung fibrosis", use_gpt_filter=True)
💻 Example (CLI)
Launch the interactive CLI:
geo-search
- Use the arrow keys to select dataset type and filtering options
- Enter your disease query
- Results will be saved to a local CSV file in a new directory
- Review and use the datasets for downstream DE analysis
🧠 GPT Filtering (Optional)
If enabled, the SDK uses GPT to evaluate whether each dataset is suitable for differential gene expression analysis. You can configure GPT behavior with:
- Adjustable confidence thresholds
📁 Project Structure
gse-pysearch/
├── geo_pysearch/
│ ├── data/ # Prebuilt FAISS index, vectors, metadata
│ ├── vector_search/
│ │ ├── vector_search.py
│ │ ├── gpt_filter.py
│ │ ├── tiered_gpt_filter.py
│ ├── sdk.py # Main SDK interface
│ └── cli.py # CLI implementation
├── examples/ # Example usage scripts
├── .env # Optional environment variables
🛠️ Requirements
- Python 3.12+
faiss-cpu,pandas,sentence-transformers
📖 License
GNU General Public License v3.0
This project is licensed under the GNU GPLv3, which guarantees end users the freedom to run, study, share, and modify the software.
If you redistribute or modify this software, your contributions must also be licensed under the same terms.
References
This project implements semantic query generation and evidence extraction strategies inspired by:
-
Deka, P., Jurek-Loughrey, A., & others. (2022). Evidence Extraction to Validate Medical Claims in Fake News Detection. International Conference on Health Information Science, pp. 3–15.
-
Deka, P., & Jurek-Loughrey, A. (2021). Unsupervised Keyword Combination Query Generation from Online Health Related Content for Evidence-Based Fact Checking. The 23rd International Conference on Information Integration and Web Intelligence, pp. 267–277.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geo_pysearch-0.1.5.tar.gz.
File metadata
- Download URL: geo_pysearch-0.1.5.tar.gz
- Upload date:
- Size: 42.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e6a1ac47162f0d872795006ad235f4133e9d677d1c328bfaf5316a43f3c865c
|
|
| MD5 |
1a68f0691fda64ca33ec0a7aaca3605f
|
|
| BLAKE2b-256 |
d0798c1d8f3d13161fcadbf7606d1c82f4c4d0b437279fb4ca874ee1b7c9bac3
|
File details
Details for the file geo_pysearch-0.1.5-py3-none-any.whl.
File metadata
- Download URL: geo_pysearch-0.1.5-py3-none-any.whl
- Upload date:
- Size: 48.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21ecea2f1dd9c792ff207051b3ea2712656f886917383b2c294e6ec8056fb630
|
|
| MD5 |
90a1dc4df602532e65f366d5284ae8b9
|
|
| BLAKE2b-256 |
5abd39a6b841279ca5c651abcc0d28152c6aaa82a69eb8eea79e33ab346cbcf5
|