Skip to main content

RAG pipeline for gene-disease literature summarization

Project description

RAG-Powered Gene Discovery Assistant (ragbio)

A generative AI tool for biomedical knowledge discovery using Hugging Face embeddings and Ollama LLMs (DeepSeek / LLaMA3). This project integrates retrieval-augmented generation (RAG) with PubMed literature and gene annotation data to summarize gene–disease relationships. Now packaged as a reusable Python package, it can be imported and used in multiple bioinformatics projects.


Overview

The RAG-powered assistant enables:

  • Semantic search over PubMed abstracts and gene annotations.
  • Summarization of complex biomedical information.
  • Citation tracking with PubMed IDs.
  • Modular and reusable pipeline for gene-disease exploration.

Example queries:

  • "Which genes are linked to oxidative stress in Alzheimer’s disease?"
  • "Summarize recent findings about TP53 variants in cancer."

Architecture

User Query
│
▼
Vector Retrieval (BioBERT / BioSentVec Embeddings)
│
▼
Top Abstracts + Gene Annotations
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Summarized Biomedical Answer + Citations

Installation

Clone and install as a package:

git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .

Example requirements.txt

langchain
faiss-cpu
sentence-transformers
biopython
pymed
requests
sqlite-utils
ollama
pandas
beautifulsoup4

Usage

1. Fetch PubMed Data

from ragbio.utils.data_loader import main as fetch_pubmed_data

# Download abstracts and metadata
fetch_pubmed_data()

2. Run RAG Query

from ragbio import run_rag_query

result = run_rag_query("genes linked to Parkinson's disease")
print(result["summary"])
print(result["citations"])

Example Output Network

Here is an example of the gene–disease–drug network generated by the RAG pipeline:

RAG Network Graph

Figure 1: Gene–disease–drug co-occurrence network generated from top PubMed abstracts.


3. Optional: Explore in Notebook

Open notebooks/RAG_GeneDiscovery_Assistant.ipynb to see example queries, visualizations, and outputs.


Technologies Used

Category Tool
Embeddings BioBERT, BioSentVec (Hugging Face)
LLM Backend DeepSeek / LLaMA3 (Ollama)
Retrieval FAISS
Data Sources PubMed, UniProt, NCBI Gene
Language Python 3.10+
Frameworks LangChain (optional), Sentence Transformers

Future Enhancements

  • Compare DeepSeek/LLaMA3 with BioGPT outputs.
  • Integrate Neo4j for gene–disease–drug knowledge graph visualization.
  • Fine-tune LLMs on curated variant interpretation reports for improved clinical relevance.
  • Extend package API for direct integration in Django, FastAPI, and Streamlit apps.

Author

Manish Kumar Senior Bioinformatics Software Developer | AI Researcher | Data Science Enthusiast

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbio-0.1.12.tar.gz (315.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbio-0.1.12-py3-none-any.whl (311.4 kB view details)

Uploaded Python 3

File details

Details for the file ragbio-0.1.12.tar.gz.

File metadata

  • Download URL: ragbio-0.1.12.tar.gz
  • Upload date:
  • Size: 315.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.12.tar.gz
Algorithm Hash digest
SHA256 04354dc6fd68869579b00f639f607d8745198b47e9bdaa59481670db1bc2d443
MD5 4cf2f0e2ae138c5fa77131c79c2e7f72
BLAKE2b-256 345362f7f4e6ea6e004f8ef0bcfe3006736331f55847fa6dea5ab395b84f1663

See more details on using hashes here.

File details

Details for the file ragbio-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: ragbio-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 311.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 ee16b8fb8cf5208b9db6521a9d904f80bf758be4202a8a8cf400a93b0af5555f
MD5 aeff1c7e2ebd71678f40f420cc162ed3
BLAKE2b-256 49efb76938c77f3c7b1ea66f2f25e49ae4b169e1f8050de61183a14e6ac4d4fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page