Skip to main content

RAG pipeline for gene-disease literature summarization

Project description

RAG-Powered Gene Discovery Assistant (ragbio)

A generative AI tool for biomedical knowledge discovery using Hugging Face embeddings and Ollama LLMs (DeepSeek / LLaMA3). This project integrates retrieval-augmented generation (RAG) with PubMed literature and gene annotation data to summarize gene–disease relationships. Now packaged as a reusable Python package, it can be imported and used in multiple bioinformatics projects.


Overview

The RAG-powered assistant enables:

  • Semantic search over PubMed abstracts and gene annotations.
  • Summarization of complex biomedical information.
  • Citation tracking with PubMed IDs.
  • Modular and reusable pipeline for gene–disease exploration.
  • Case study–specific queries using query_name to organize outputs and visualizations.

Example queries:

  • "Which genes are linked to oxidative stress in Alzheimer’s disease?"
  • "Summarize recent findings about TP53 variants in cancer."

Architecture

User Query
│
▼
Vector Retrieval (BioBERT / BioSentVec Embeddings)
│
▼
Top Abstracts + Gene Annotations
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Summarized Biomedical Answer + Citations
│
▼
Optional: Gene–Disease–Drug Network (Cytoscape / Streamlit)

Installation

Clone and install as a package:

git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .

Example requirements.txt

langchain
faiss-cpu
sentence-transformers
biopython
pymed
requests
sqlite-utils
ollama
pandas
beautifulsoup4
streamlit

Usage

1. Fetch PubMed Data

from ragbio.utils.rag_data_loader import main as fetch_pubmed_data

# Download abstracts and metadata
fetch_pubmed_data()

2. Run RAG Query (with query_name)

from ragbio.pipeline.rag_pipeline import RAGAssistant

# Use query_name to organize output
assistant = RAGAssistant(output_dir="output/Alzheimer_CaseStudy", query_name="Alzheimer_CaseStudy")
summary, pmids, structured = assistant.run_pipeline(
    "genes linked to Alzheimer’s disease",
    top_k=10,
    structured=True
)

Outputs are stored under output/<query_name>/ for easier tracking.


3. Visualize Gene–Disease–Drug Networks

You can visualize the structured outputs in Cytoscape via Streamlit:

streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py -- --query_name Alzheimer_CaseStudy

This reads all JSON files from output/<query_name>/ and plots the gene–target–drug–disease network.

Example Output Network

RAG Network Graph

Figure 1: Gene–disease–drug co-occurrence network generated from top PubMed abstracts.


4. Optional: Explore in Notebook

Open notebooks/RAG_GeneDiscovery_Assistant.ipynb to see example queries, visualizations, and outputs.


Technologies Used

Category Tool
Embeddings BioBERT, BioSentVec (Hugging Face)
LLM Backend DeepSeek / LLaMA3 (Ollama)
Retrieval FAISS
Data Sources PubMed, UniProt, NCBI Gene
Language Python 3.10+
Frameworks LangChain, Sentence Transformers, Streamlit

Future Enhancements

  • Compare DeepSeek/LLaMA3 with BioGPT outputs.
  • Integrate Neo4j for gene–disease–drug knowledge graph visualization.
  • Fine-tune LLMs on curated variant interpretation reports for improved clinical relevance.
  • Extend package API for direct integration in Django, FastAPI, and Streamlit apps.
  • Add support for multiple query_name outputs to track different case studies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbio-0.1.14.tar.gz (317.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbio-0.1.14-py3-none-any.whl (313.7 kB view details)

Uploaded Python 3

File details

Details for the file ragbio-0.1.14.tar.gz.

File metadata

  • Download URL: ragbio-0.1.14.tar.gz
  • Upload date:
  • Size: 317.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.14.tar.gz
Algorithm Hash digest
SHA256 ab67eec69debe1169cf99096fe2c2afb88d8e8237a7b3df42ab7aa6ec72bf80f
MD5 8fefaf30e3649f793932ca3aea0e240b
BLAKE2b-256 6546d81d49123a6d72d33a8dc738cd291840dbb0364981934eed92b812593784

See more details on using hashes here.

File details

Details for the file ragbio-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: ragbio-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 313.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 49ccdd4d6c82d6ae8055fb2d5ad3b31342149ffba0cb0774cccf7b370557ca0a
MD5 6107e79604ecb21c35467eecb7c8e00a
BLAKE2b-256 c7a8c7101107c29a737cc4ccca14f007fd24cb580c12efe4ab2cb20485ec68b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page