RAG pipeline for gene-disease literature summarization
Project description
RAG-Powered Gene Discovery Assistant (ragbio)
A generative AI tool for biomedical knowledge discovery using Hugging Face embeddings and Ollama LLMs (DeepSeek / LLaMA3). This project integrates retrieval-augmented generation (RAG) with PubMed literature and gene annotation data to summarize gene–disease relationships. Now packaged as a reusable Python package, it can be imported and used in multiple bioinformatics projects.
Overview
The RAG-powered assistant enables:
- Semantic search over PubMed abstracts and gene annotations.
- Summarization of complex biomedical information.
- Citation tracking with PubMed IDs.
- Modular and reusable pipeline for gene–disease exploration.
- Case study–specific queries using
query_nameto organize outputs and visualizations.
Example queries:
- "Which genes are linked to oxidative stress in Alzheimer’s disease?"
- "Summarize recent findings about TP53 variants in cancer."
Architecture
User Query
│
▼
Vector Retrieval (BioBERT / BioSentVec Embeddings)
│
▼
Top Abstracts + Gene Annotations
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Summarized Biomedical Answer + Citations
│
▼
Optional: Gene–Disease–Drug Network (Cytoscape / Streamlit)
Installation
Clone and install as a package:
git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .
Example requirements.txt
langchain
faiss-cpu
sentence-transformers
biopython
pymed
requests
sqlite-utils
ollama
pandas
beautifulsoup4
streamlit
Usage
1. Fetch PubMed Data
from ragbio.utils.rag_data_loader import main as fetch_pubmed_data
# Download abstracts and metadata
fetch_pubmed_data()
2. Run RAG Query (with query_name)
from ragbio.pipeline.rag_pipeline import RAGAssistant
# Use query_name to organize output
assistant = RAGAssistant(output_dir="output/Alzheimer_CaseStudy", query_name="Alzheimer_CaseStudy")
summary, pmids, structured = assistant.run_pipeline(
"genes linked to Alzheimer’s disease",
top_k=10,
structured=True
)
Outputs are stored under output/<query_name>/ for easier tracking.
3. Visualize Gene–Disease–Drug Networks
You can visualize the structured outputs in Cytoscape via Streamlit:
streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py -- --query_name Alzheimer_CaseStudy
This reads all JSON files from output/<query_name>/ and plots the gene–target–drug–disease network.
Example Output Network
Figure 1: Gene–disease–drug co-occurrence network generated from top PubMed abstracts.
4. Optional: Explore in Notebook
Open notebooks/RAG_GeneDiscovery_Assistant.ipynb to see example queries, visualizations, and outputs.
Technologies Used
| Category | Tool |
|---|---|
| Embeddings | BioBERT, BioSentVec (Hugging Face) |
| LLM Backend | DeepSeek / LLaMA3 (Ollama) |
| Retrieval | FAISS |
| Data Sources | PubMed, UniProt, NCBI Gene |
| Language | Python 3.10+ |
| Frameworks | LangChain, Sentence Transformers, Streamlit |
Future Enhancements
- Compare DeepSeek/LLaMA3 with BioGPT outputs.
- Integrate Neo4j for gene–disease–drug knowledge graph visualization.
- Fine-tune LLMs on curated variant interpretation reports for improved clinical relevance.
- Extend package API for direct integration in Django, FastAPI, and Streamlit apps.
- Add support for multiple query_name outputs to track different case studies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragbio-0.1.13.tar.gz.
File metadata
- Download URL: ragbio-0.1.13.tar.gz
- Upload date:
- Size: 314.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
355e441e7f293349e151e737461bf75050e707149f4fbc728eddff224f7f4714
|
|
| MD5 |
620ae11370c5950901a64a0d4104c7d1
|
|
| BLAKE2b-256 |
886a2ec5f1bd89c003be226c58beb02fac2d381fbf82e0431448569082feccf9
|
File details
Details for the file ragbio-0.1.13-py3-none-any.whl.
File metadata
- Download URL: ragbio-0.1.13-py3-none-any.whl
- Upload date:
- Size: 312.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b3dfc35267b45f0cc57fba8877da322b91429f8a270bef2422731f82b62a95d
|
|
| MD5 |
fd7b15e9c5c600efd296e62a04170cd8
|
|
| BLAKE2b-256 |
dc242258809d9b74a8ca4f242fff2e0710b7bf4931298e52997448b7aa979848
|