Skip to main content

A study-aware biomedical RAG framework for PubMed retrieval, citation-grounded summaries, and downstream omics integration

Project description

RAG-Powered Gene Discovery Assistant (ragbio)

ragbio is a study-aware, retrieval-augmented generation (RAG) toolkit for biomedical knowledge discovery built on PubMed literature, FAISS vector search, and Ollama-based large language models (DeepSeek, LLaMA-family).

It is designed as a reusable Python package that can operate standalone or as a backend service inside larger platforms such as OmniBioAI, powering chat, literature summarization, and downstream bioinformatics workflows.


Key Capabilities

ragbio enables:

  • Semantic search over PubMed abstracts using FAISS
  • Study-scoped ingestion and indexing for reproducibility
  • Multi-study search (single study or global across all studies)
  • Instant PMID retrieval (FAISS-only, no LLM calls)
  • LLM-based biomedical summarization grounded in retrieved literature
  • Structured JSON outputs for literature summarizers and reporting
  • Optional drug–target–disease extraction and KG population
  • Progress reporting hooks for real-time UI dashboards
  • Cache-aware retrieval for low-latency interactive use

Example Questions

  • Which genes are associated with oxidative stress in Alzheimer’s disease?
  • What therapies target amyloid pathways according to recent literature?
  • Summarize evidence linking TP53 variants to cancer therapies.
  • Return PMIDs related to BRCA1 drug resistance (no summarization).

High-Level Architecture

User Query
│
├─► FAISS Retrieval (study-specific or multi-study)
│
├─► Top-K PubMed Abstracts
│
├─► (Optional) LLM Summarization (RAG)
│
├─► (Optional) Structured Extraction (JSON)
│
└─► Outputs:
     • PMIDs (instant)
     • Grounded summaries
     • Structured JSON (literature summarizer / KG)

Important design choice (v1.1): FAISS retrieval is executed exactly once per request, and all downstream steps reuse the same retrieved documents. There is no duplicate search.


Installation

Install from PyPI (recommended)

pip install ragbio

Development install (from source)

git clone https://github.com/man4ish/omnibioai-rag.git
cd omnibioai-rag
pip install -e .

Data Organization (Study-Aware)

By default, all PubMed data is organized under:

data/PubMed/
├── Abstracts/<study>/
├── Metadata/<study>/
├── PDFs/<study>/
└── Index/<study>/

This enables:

  • Clean separation of case studies
  • Reproducible indexing
  • Safe multi-study search

Usage Guide

1️⃣ Ingest PubMed Literature (Study-Aware)

python -m ragbio.utils.rag_data_loader \
  --study Alzheimer_CaseStudy \
  --search "Alzheimer Disease AND therapy" \
  --retmax 500 \
  --retstart 0

This step:

  • Fetches PubMed abstracts and metadata
  • Stores results under Abstracts/<study>/
  • Optionally downloads open-access PDFs

2️⃣ Generate Embeddings & Build FAISS Index

python -m ragbio.embeddings.embedding_engine \
  --study Alzheimer_CaseStudy
  • Reads abstracts from Abstracts/<study>/
  • Generates embeddings via Ollama
  • Writes FAISS index to Index/<study>/

3️⃣ Run RAG Queries (CLI)

python -m ragbio.pipeline.rag_pipeline \
  --query "Which therapies target amyloid pathways in Alzheimer’s disease?" \
  --top_k 10 \
  --structured \
  --study Alzheimer_CaseStudy

Outputs include:

  • Grounded summary
  • Supporting PMIDs
  • Optional structured JSON
  • Optional Neo4j KG updates

4️⃣ Instant PMID Retrieval (No LLM)

For low-latency applications (chat, TES, pipelines):

python -m ragbio.pipeline.rag_pipeline \
  --query "TP53 apoptosis cancer therapy" \
  --pmids-only \
  --top_k 20

✔ FAISS-only ✔ Cache-aware ✔ Suitable for real-time UI


Python API (Recommended for Integration)

Public API (v1.1)

from ragbio.pipeline import (
    RAGAssistant,
    get_pmids,
    run_rag_json,
)

Instant PMID Retrieval

pmids = get_pmids(
    query="BRCA1 drug resistance",
    top_k=20,
    study=None,   # search across ALL studies
)

Structured RAG Output (Literature Summarizer)

result = run_rag_json(
    query="TP53 variants and chemotherapy response",
    top_k=10,
    study="Cancer_Study",
)

Returns structured JSON suitable for:

  • Literature summarization
  • ReportingService
  • Downstream AI agents

Advanced Usage (Long-Lived Assistant)

assistant = RAGAssistant(study="Alzheimer_CaseStudy")

pmids = assistant.get_pmids("amyloid beta clearance")

data = assistant.run_rag_json(
    "amyloid beta clearance therapies",
    structured=True,
)

Multi-Study Search (v1.1)

  • study="default" → search one study
  • study=None or "*" → search all indexed studies
  • Results are merged, ranked, and deduplicated

This allows:

  • Cross-project reuse of indexed literature
  • Global chat-style queries
  • Meta-analysis across studies

Caching & Performance

  • PMID retrieval results are cached per:

    • query
    • study
    • index version
    • embedding model
  • Cache invalidates automatically if index changes

  • Designed for sub-second responses in chat workflows


Progress Reporting (UI-Ready)

All major steps emit progress events that can be wired to:

  • OmniBioAI progress bars
  • WebSocket updates
  • TES run monitors

Example stages:

  • retrieval_start
  • retrieval_complete
  • llm_summarization
  • structured_extraction_complete

Technologies Used

Category Tools
Language Python 3.10+
Retrieval FAISS
Embeddings Ollama embedding models
LLMs DeepSeek, LLaMA-family (via Ollama)
Data Source PubMed (NCBI Entrez)
Graph (optional) Neo4j
UI (optional) Streamlit, Cytoscape

Design Principles

  • Study-first organization
  • Explicit retrieval control
  • No hidden FAISS calls
  • Composable APIs
  • Safe defaults, override when needed
  • Platform-friendly (OmniBioAI, TES, agents)

Roadmap

v1.1 (current)

  • Multi-study search
  • Instant PMID retrieval
  • Structured JSON output
  • Cache-aware retrieval
  • Public Python API

v1.2+

  • Streaming RAG responses
  • Retrieval metrics & dashboards
  • Neo4j-first knowledge graphs
  • FastAPI / Django service mode
  • Citation confidence scoring
  • Multi-study comparative dashboards

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbio-0.1.19.tar.gz (194.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbio-0.1.19-py3-none-any.whl (164.3 kB view details)

Uploaded Python 3

File details

Details for the file ragbio-0.1.19.tar.gz.

File metadata

  • Download URL: ragbio-0.1.19.tar.gz
  • Upload date:
  • Size: 194.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.19.tar.gz
Algorithm Hash digest
SHA256 d8fb84311bff44b53f0591f094ea8afb506bc962cd18e5e06c3433d144110ea1
MD5 eb85c91b892814f04c84e819ec797ed5
BLAKE2b-256 5e8cc2c03cb6a3389f09e82948dc501b8fe0d9e1d4201be245849cea108b5170

See more details on using hashes here.

File details

Details for the file ragbio-0.1.19-py3-none-any.whl.

File metadata

  • Download URL: ragbio-0.1.19-py3-none-any.whl
  • Upload date:
  • Size: 164.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 154a9d72a23e68d20f96baaa91ccdacd9bfb56383dbd1ce36c106ff8dfc44524
MD5 20cfd806419a8845a8ce8b5db6e6bf2a
BLAKE2b-256 1389817fb58372af8de98089e95c0bc7e62836ff8b283bb3c643b9b95c7fe4f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page