A retrieval-augmented biomedical literature framework for evidence discovery, citation mapping, and downstream omics analysis
Project description
RAG-Powered Biomedical Evidence Framework (ragbio)
A reusable retrieval-augmented generation (RAG) toolkit for biomedical knowledge discovery built on PubMed literature, vector search, and Ollama-based LLMs (DeepSeek / LLaMA3).
ragbio enables study-aware ingestion, embedding, and querying of biomedical literature to support gene–disease–therapy exploration, summarization, and network visualization.
Now published as a pip-installable Python package and designed for integration into research pipelines and bioinformatics workflows.
Overview
The RAG-powered assistant enables:
- Semantic search over PubMed abstracts
- Study-scoped literature ingestion for reproducibility
- Summarization of complex biomedical evidence using LLMs
- Citation-aware responses grounded in PubMed IDs
- Modular ingestion → embedding → retrieval pipeline
- Optional gene–disease–drug network visualization
Example questions
- Which genes are linked to oxidative stress in Alzheimer’s disease?
- What therapies target amyloid pathways according to recent literature?
- Summarize evidence connecting TP53 variants to cancer therapies.
Architecture
User Question
│
▼
FAISS Vector Retrieval (PubMed Abstracts)
│
▼
Top-K Relevant Abstracts
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Grounded Biomedical Summary + PMIDs
│
▼
(Optional) Gene–Disease–Drug Network Visualization
Installation
Install from PyPI (recommended)
pip install ragbio
Development install (from source)
git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .
Usage
1. Ingest PubMed Literature (study-aware)
python -m ragbio.utils.rag_data_loader \
--study Alzheimer_CaseStudy \
--search "Alzheimer Disease AND therapy" \
--retmax 500 \
--retstart 0
This creates the following structure (default: data/PubMed/):
PubMed/
├── Abstracts/Alzheimer_CaseStudy/
├── Metadata/Alzheimer_CaseStudy/
├── PDFs/Alzheimer_CaseStudy/
└── Index/Alzheimer_CaseStudy/
2. Generate Embeddings & Build FAISS Index
python -m ragbio.embeddings.embedding_engine \
--study Alzheimer_CaseStudy
- Reads from
Abstracts/<study>/ - Writes FAISS index to
Index/<study>/
3. Run RAG Queries
python -m ragbio.pipeline.rag_pipeline \
--query "Which therapies target amyloid pathways in Alzheimer’s disease?" \
--top_k 10 \
--structured \
--study Alzheimer_CaseStudy
Outputs are generated per study for clean provenance and reproducibility.
4. Visualize Gene–Disease–Drug Networks (optional)
Launch the Streamlit app:
streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py --study Alzheimer_CaseStudy
This reads structured outputs and visualizes gene–disease–drug relationships as an interactive network.
Example: Gene–disease–drug co-occurrence network derived from PubMed abstracts.
5. Optional: Notebook Exploration
Explore example workflows in:
notebooks/RAG_GeneDiscovery_Assistant.ipynb
Technologies Used
| Category | Tools |
|---|---|
| Embeddings | Ollama embedding models (configurable) |
| LLMs | DeepSeek, LLaMA3 (via Ollama) |
| Retrieval | FAISS |
| Data Sources | PubMed (NCBI Entrez) |
| Visualization | Streamlit, Cytoscape |
| Language | Python 3.10+ |
Design Principles
- Study-first organization for reproducibility
- Separation of concerns (ingestion ≠ embedding ≠ retrieval)
- Grounded answers with PubMed citations
- Composable modules usable outside the CLI
- Safe defaults with override via CLI or environment variables
Future Enhancements
- Neo4j-backed gene–disease–drug knowledge graphs
- Comparative evaluation of DeepSeek vs BioGPT outputs
- Variant-level evidence integration
- API support for FastAPI / Django
- Automated citation grounding and confidence scoring
- Multi-study dashboards and comparisons
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragbio-0.2.0.tar.gz.
File metadata
- Download URL: ragbio-0.2.0.tar.gz
- Upload date:
- Size: 170.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81119a888e14a6cac3a49b49debd05fe892dc474c534839ab3fa76fd03c0c765
|
|
| MD5 |
e846f88d21e9bc0b5676a7367f700eb9
|
|
| BLAKE2b-256 |
fe46ff2fc5045df43573db716965a6da68f2b57ac8580a48fb653be837f562e6
|
File details
Details for the file ragbio-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ragbio-0.2.0-py3-none-any.whl
- Upload date:
- Size: 164.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
254d3991dcdef3e8193adf6989166fbf9d37cca3c672faa7f72279d6549875cc
|
|
| MD5 |
3050b562ebf7508a9d3e68a3839eff63
|
|
| BLAKE2b-256 |
49ddc6f5be435b421e56629a776c577c254e7df3ad6611834a79c6316fcee623
|