RAG pipeline for gene-disease literature summarization
Project description
RAG-Powered Gene Discovery Assistant (ragbio)
A generative AI tool for biomedical knowledge discovery using Hugging Face embeddings and Ollama LLMs (DeepSeek / LLaMA3). This project integrates retrieval-augmented generation (RAG) with PubMed literature and gene annotation data to summarize gene–disease relationships. Now packaged as a reusable Python package, it can be imported and used in multiple bioinformatics projects.
Overview
The RAG-powered assistant enables:
- Semantic search over PubMed abstracts and gene annotations.
- Summarization of complex biomedical information.
- Citation tracking with PubMed IDs.
- Modular and reusable pipeline for gene-disease exploration.
Example queries:
- "Which genes are linked to oxidative stress in Alzheimer’s disease?"
- "Summarize recent findings about TP53 variants in cancer."
Architecture
User Query
│
▼
Vector Retrieval (BioBERT / BioSentVec Embeddings)
│
▼
Top Abstracts + Gene Annotations
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Summarized Biomedical Answer + Citations
Installation
Clone and install as a package:
git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .
Example requirements.txt
langchain
faiss-cpu
sentence-transformers
biopython
pymed
requests
sqlite-utils
ollama
pandas
beautifulsoup4
Usage
1. Fetch PubMed Data
from ragbio.utils.data_loader import main as fetch_pubmed_data
# Download abstracts and metadata
fetch_pubmed_data()
2. Run RAG Query
from ragbio import run_rag_query
result = run_rag_query("genes linked to Parkinson's disease")
print(result["summary"])
print(result["citations"])
Example Output Network
Here is an example of the gene–disease–drug network generated by the RAG pipeline:
Figure 1: Gene–disease–drug co-occurrence network generated from top PubMed abstracts.
3. Optional: Explore in Notebook
Open notebooks/RAG_GeneDiscovery_Assistant.ipynb to see example queries, visualizations, and outputs.
Technologies Used
| Category | Tool |
|---|---|
| Embeddings | BioBERT, BioSentVec (Hugging Face) |
| LLM Backend | DeepSeek / LLaMA3 (Ollama) |
| Retrieval | FAISS |
| Data Sources | PubMed, UniProt, NCBI Gene |
| Language | Python 3.10+ |
| Frameworks | LangChain (optional), Sentence Transformers |
Future Enhancements
- Compare DeepSeek/LLaMA3 with BioGPT outputs.
- Integrate Neo4j for gene–disease–drug knowledge graph visualization.
- Fine-tune LLMs on curated variant interpretation reports for improved clinical relevance.
- Extend package API for direct integration in Django, FastAPI, and Streamlit apps.
Author
Manish Kumar Senior Bioinformatics Software Developer | AI Researcher | Data Science Enthusiast
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragbio-0.1.12.tar.gz.
File metadata
- Download URL: ragbio-0.1.12.tar.gz
- Upload date:
- Size: 315.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04354dc6fd68869579b00f639f607d8745198b47e9bdaa59481670db1bc2d443
|
|
| MD5 |
4cf2f0e2ae138c5fa77131c79c2e7f72
|
|
| BLAKE2b-256 |
345362f7f4e6ea6e004f8ef0bcfe3006736331f55847fa6dea5ab395b84f1663
|
File details
Details for the file ragbio-0.1.12-py3-none-any.whl.
File metadata
- Download URL: ragbio-0.1.12-py3-none-any.whl
- Upload date:
- Size: 311.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee16b8fb8cf5208b9db6521a9d904f80bf758be4202a8a8cf400a93b0af5555f
|
|
| MD5 |
aeff1c7e2ebd71678f40f420cc162ed3
|
|
| BLAKE2b-256 |
49efb76938c77f3c7b1ea66f2f25e49ae4b169e1f8050de61183a14e6ac4d4fe
|