Skip to main content

RAG pipeline for gene-disease literature summarization

Project description

RAG-Powered Gene Discovery Assistant (ragbio)

A generative AI tool for biomedical knowledge discovery using Hugging Face embeddings and Ollama LLMs (DeepSeek / LLaMA3). This project integrates retrieval-augmented generation (RAG) with PubMed literature and gene annotation data to summarize gene–disease relationships. Now packaged as a reusable Python package, it can be imported and used in multiple bioinformatics projects.


Overview

The RAG-powered assistant enables:

  • Semantic search over PubMed abstracts and gene annotations.
  • Summarization of complex biomedical information.
  • Citation tracking with PubMed IDs.
  • Modular and reusable pipeline for gene–disease exploration.
  • Case study–specific queries using query_name to organize outputs and visualizations.

Example queries:

  • "Which genes are linked to oxidative stress in Alzheimer’s disease?"
  • "Summarize recent findings about TP53 variants in cancer."

Architecture

User Query
│
▼
Vector Retrieval (BioBERT / BioSentVec Embeddings)
│
▼
Top Abstracts + Gene Annotations
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Summarized Biomedical Answer + Citations
│
▼
Optional: Gene–Disease–Drug Network (Cytoscape / Streamlit)

Installation

Clone and install as a package:

git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .

Example requirements.txt

langchain
faiss-cpu
sentence-transformers
biopython
pymed
requests
sqlite-utils
ollama
pandas
beautifulsoup4
streamlit

Usage

1. Fetch PubMed Data

from ragbio.utils.rag_data_loader import main as fetch_pubmed_data

# Download abstracts and metadata
fetch_pubmed_data()

2. Run RAG Query (with query_name)

from ragbio.pipeline.rag_pipeline import RAGAssistant

# Use query_name to organize output
assistant = RAGAssistant(output_dir="output/Alzheimer_CaseStudy", query_name="Alzheimer_CaseStudy")
summary, pmids, structured = assistant.run_pipeline(
    "genes linked to Alzheimer’s disease",
    top_k=10,
    structured=True
)

Outputs are stored under output/<query_name>/ for easier tracking.


3. Visualize Gene–Disease–Drug Networks

You can visualize the structured outputs in Cytoscape via Streamlit:

streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py -- --query_name Alzheimer_CaseStudy

This reads all JSON files from output/<query_name>/ and plots the gene–target–drug–disease network.

Example Output Network

RAG Network Graph

Figure 1: Gene–disease–drug co-occurrence network generated from top PubMed abstracts.


4. Optional: Explore in Notebook

Open notebooks/RAG_GeneDiscovery_Assistant.ipynb to see example queries, visualizations, and outputs.


Technologies Used

Category Tool
Embeddings BioBERT, BioSentVec (Hugging Face)
LLM Backend DeepSeek / LLaMA3 (Ollama)
Retrieval FAISS
Data Sources PubMed, UniProt, NCBI Gene
Language Python 3.10+
Frameworks LangChain, Sentence Transformers, Streamlit

Future Enhancements

  • Compare DeepSeek/LLaMA3 with BioGPT outputs.
  • Integrate Neo4j for gene–disease–drug knowledge graph visualization.
  • Fine-tune LLMs on curated variant interpretation reports for improved clinical relevance.
  • Extend package API for direct integration in Django, FastAPI, and Streamlit apps.
  • Add support for multiple query_name outputs to track different case studies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbio-0.1.13.tar.gz (314.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbio-0.1.13-py3-none-any.whl (312.8 kB view details)

Uploaded Python 3

File details

Details for the file ragbio-0.1.13.tar.gz.

File metadata

  • Download URL: ragbio-0.1.13.tar.gz
  • Upload date:
  • Size: 314.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.13.tar.gz
Algorithm Hash digest
SHA256 355e441e7f293349e151e737461bf75050e707149f4fbc728eddff224f7f4714
MD5 620ae11370c5950901a64a0d4104c7d1
BLAKE2b-256 886a2ec5f1bd89c003be226c58beb02fac2d381fbf82e0431448569082feccf9

See more details on using hashes here.

File details

Details for the file ragbio-0.1.13-py3-none-any.whl.

File metadata

  • Download URL: ragbio-0.1.13-py3-none-any.whl
  • Upload date:
  • Size: 312.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 3b3dfc35267b45f0cc57fba8877da322b91429f8a270bef2422731f82b62a95d
MD5 fd7b15e9c5c600efd296e62a04170cd8
BLAKE2b-256 dc242258809d9b74a8ca4f242fff2e0710b7bf4931298e52997448b7aa979848

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page