Skip to main content

A retrieval-augmented biomedical literature framework for evidence discovery, citation mapping, and downstream omics analysis

Project description

RAG-Powered Biomedical Evidence Framework (ragbio)

A reusable retrieval-augmented generation (RAG) toolkit for biomedical knowledge discovery built on PubMed literature, vector search, and Ollama-based LLMs (DeepSeek / LLaMA3).

ragbio enables study-aware ingestion, embedding, and querying of biomedical literature to support gene–disease–therapy exploration, summarization, and network visualization.

Now published as a pip-installable Python package and designed for integration into research pipelines and bioinformatics workflows.


Overview

The RAG-powered assistant enables:

  • Semantic search over PubMed abstracts
  • Study-scoped literature ingestion for reproducibility
  • Summarization of complex biomedical evidence using LLMs
  • Citation-aware responses grounded in PubMed IDs
  • Modular ingestion → embedding → retrieval pipeline
  • Optional gene–disease–drug network visualization

Example questions

  • Which genes are linked to oxidative stress in Alzheimer’s disease?
  • What therapies target amyloid pathways according to recent literature?
  • Summarize evidence connecting TP53 variants to cancer therapies.

Architecture

User Question
│
▼
FAISS Vector Retrieval (PubMed Abstracts)
│
▼
Top-K Relevant Abstracts
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Grounded Biomedical Summary + PMIDs
│
▼
(Optional) Gene–Disease–Drug Network Visualization

Installation

Install from PyPI (recommended)

pip install ragbio

Development install (from source)

git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .

Usage

1. Ingest PubMed Literature (study-aware)

python -m ragbio.utils.rag_data_loader \
  --study Alzheimer_CaseStudy \
  --search "Alzheimer Disease AND therapy" \
  --retmax 500 \
  --retstart 0

This creates the following structure (default: data/PubMed/):

PubMed/
├── Abstracts/Alzheimer_CaseStudy/
├── Metadata/Alzheimer_CaseStudy/
├── PDFs/Alzheimer_CaseStudy/
└── Index/Alzheimer_CaseStudy/

2. Generate Embeddings & Build FAISS Index

python -m ragbio.embeddings.embedding_engine \
  --study Alzheimer_CaseStudy
  • Reads from Abstracts/<study>/
  • Writes FAISS index to Index/<study>/

3. Run RAG Queries

python -m ragbio.pipeline.rag_pipeline \
  --query "Which therapies target amyloid pathways in Alzheimer’s disease?" \
  --top_k 10 \
  --structured \
  --study Alzheimer_CaseStudy

Outputs are generated per study for clean provenance and reproducibility.


4. Visualize Gene–Disease–Drug Networks (optional)

Launch the Streamlit app:

streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py --study Alzheimer_CaseStudy

This reads structured outputs and visualizes gene–disease–drug relationships as an interactive network.

RAG Network Graph

Example: Gene–disease–drug co-occurrence network derived from PubMed abstracts.


5. Optional: Notebook Exploration

Explore example workflows in:

notebooks/RAG_GeneDiscovery_Assistant.ipynb

Technologies Used

Category Tools
Embeddings Ollama embedding models (configurable)
LLMs DeepSeek, LLaMA3 (via Ollama)
Retrieval FAISS
Data Sources PubMed (NCBI Entrez)
Visualization Streamlit, Cytoscape
Language Python 3.10+

Design Principles

  • Study-first organization for reproducibility
  • Separation of concerns (ingestion ≠ embedding ≠ retrieval)
  • Grounded answers with PubMed citations
  • Composable modules usable outside the CLI
  • Safe defaults with override via CLI or environment variables

Future Enhancements

  • Neo4j-backed gene–disease–drug knowledge graphs
  • Comparative evaluation of DeepSeek vs BioGPT outputs
  • Variant-level evidence integration
  • API support for FastAPI / Django
  • Automated citation grounding and confidence scoring
  • Multi-study dashboards and comparisons

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbio-0.2.0.tar.gz (170.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbio-0.2.0-py3-none-any.whl (164.6 kB view details)

Uploaded Python 3

File details

Details for the file ragbio-0.2.0.tar.gz.

File metadata

  • Download URL: ragbio-0.2.0.tar.gz
  • Upload date:
  • Size: 170.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.2.0.tar.gz
Algorithm Hash digest
SHA256 81119a888e14a6cac3a49b49debd05fe892dc474c534839ab3fa76fd03c0c765
MD5 e846f88d21e9bc0b5676a7367f700eb9
BLAKE2b-256 fe46ff2fc5045df43573db716965a6da68f2b57ac8580a48fb653be837f562e6

See more details on using hashes here.

File details

Details for the file ragbio-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ragbio-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 164.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 254d3991dcdef3e8193adf6989166fbf9d37cca3c672faa7f72279d6549875cc
MD5 3050b562ebf7508a9d3e68a3839eff63
BLAKE2b-256 49ddc6f5be435b421e56629a776c577c254e7df3ad6611834a79c6316fcee623

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page