A study-aware biomedical RAG framework for PubMed retrieval, citation-grounded summaries, and downstream omics integration

Project description

RAG-Powered Gene Discovery Assistant (`ragbio`)

ragbio is a study-aware, retrieval-augmented generation (RAG) toolkit for biomedical knowledge discovery built on PubMed literature, FAISS vector search, and Ollama-based large language models (DeepSeek, LLaMA-family).

It is designed as a reusable Python package that can operate standalone or as a backend service inside larger platforms such as OmniBioAI, powering chat, literature summarization, and downstream bioinformatics workflows.

Key Capabilities

ragbio enables:

Semantic search over PubMed abstracts using FAISS
Study-scoped ingestion and indexing for reproducibility
Multi-study search (single study or global across all studies)
Instant PMID retrieval (FAISS-only, no LLM calls)
LLM-based biomedical summarization grounded in retrieved literature
Structured JSON outputs for literature summarizers and reporting
Optional drug–target–disease extraction and KG population
Progress reporting hooks for real-time UI dashboards
Cache-aware retrieval for low-latency interactive use

Example Questions

Which genes are associated with oxidative stress in Alzheimer’s disease?
What therapies target amyloid pathways according to recent literature?
Summarize evidence linking TP53 variants to cancer therapies.
Return PMIDs related to BRCA1 drug resistance (no summarization).

High-Level Architecture

User Query
│
├─► FAISS Retrieval (study-specific or multi-study)
│
├─► Top-K PubMed Abstracts
│
├─► (Optional) LLM Summarization (RAG)
│
├─► (Optional) Structured Extraction (JSON)
│
└─► Outputs:
     • PMIDs (instant)
     • Grounded summaries
     • Structured JSON (literature summarizer / KG)

Important design choice (v1.1): FAISS retrieval is executed exactly once per request, and all downstream steps reuse the same retrieved documents. There is no duplicate search.

Installation

Install from PyPI (recommended)

pip install ragbio

Development install (from source)

git clone https://github.com/man4ish/omnibioai-rag.git
cd omnibioai-rag
pip install -e .

Data Organization (Study-Aware)

By default, all PubMed data is organized under:

data/PubMed/
├── Abstracts/<study>/
├── Metadata/<study>/
├── PDFs/<study>/
└── Index/<study>/

This enables:

Clean separation of case studies
Reproducible indexing
Safe multi-study search

Usage Guide

1️⃣ Ingest PubMed Literature (Study-Aware)

python -m ragbio.utils.rag_data_loader \
  --study Alzheimer_CaseStudy \
  --search "Alzheimer Disease AND therapy" \
  --retmax 500 \
  --retstart 0

This step:

Fetches PubMed abstracts and metadata
Stores results under Abstracts/<study>/
Optionally downloads open-access PDFs

2️⃣ Generate Embeddings & Build FAISS Index

python -m ragbio.embeddings.embedding_engine \
  --study Alzheimer_CaseStudy

Reads abstracts from Abstracts/<study>/
Generates embeddings via Ollama
Writes FAISS index to Index/<study>/

3️⃣ Run RAG Queries (CLI)

python -m ragbio.pipeline.rag_pipeline \
  --query "Which therapies target amyloid pathways in Alzheimer’s disease?" \
  --top_k 10 \
  --structured \
  --study Alzheimer_CaseStudy

Outputs include:

Grounded summary
Supporting PMIDs
Optional structured JSON
Optional Neo4j KG updates

4️⃣ Instant PMID Retrieval (No LLM)

For low-latency applications (chat, TES, pipelines):

python -m ragbio.pipeline.rag_pipeline \
  --query "TP53 apoptosis cancer therapy" \
  --pmids-only \
  --top_k 20

✔ FAISS-only ✔ Cache-aware ✔ Suitable for real-time UI

Python API (Recommended for Integration)

Public API (v1.1)

from ragbio.pipeline import (
    RAGAssistant,
    get_pmids,
    run_rag_json,
)

Instant PMID Retrieval

pmids = get_pmids(
    query="BRCA1 drug resistance",
    top_k=20,
    study=None,   # search across ALL studies
)

Structured RAG Output (Literature Summarizer)

result = run_rag_json(
    query="TP53 variants and chemotherapy response",
    top_k=10,
    study="Cancer_Study",
)

Returns structured JSON suitable for:

Literature summarization
ReportingService
Downstream AI agents

Advanced Usage (Long-Lived Assistant)

assistant = RAGAssistant(study="Alzheimer_CaseStudy")

pmids = assistant.get_pmids("amyloid beta clearance")

data = assistant.run_rag_json(
    "amyloid beta clearance therapies",
    structured=True,
)

Multi-Study Search (v1.1)

study="default" → search one study
study=None or "*" → search all indexed studies
Results are merged, ranked, and deduplicated

This allows:

Cross-project reuse of indexed literature
Global chat-style queries
Meta-analysis across studies

Caching & Performance

PMID retrieval results are cached per:
- query
- study
- index version
- embedding model
Cache invalidates automatically if index changes
Designed for sub-second responses in chat workflows

Progress Reporting (UI-Ready)

All major steps emit progress events that can be wired to:

OmniBioAI progress bars
WebSocket updates
TES run monitors

Example stages:

retrieval_start
retrieval_complete
llm_summarization
structured_extraction_complete

Technologies Used

Category	Tools
Language	Python 3.10+
Retrieval	FAISS
Embeddings	Ollama embedding models
LLMs	DeepSeek, LLaMA-family (via Ollama)
Data Source	PubMed (NCBI Entrez)
Graph (optional)	Neo4j
UI (optional)	Streamlit, Cytoscape

Design Principles

Study-first organization
Explicit retrieval control
No hidden FAISS calls
Composable APIs
Safe defaults, override when needed
Platform-friendly (OmniBioAI, TES, agents)

Roadmap

v1.1 (current)

Multi-study search
Instant PMID retrieval
Structured JSON output
Cache-aware retrieval
Public Python API

v1.2+

Streaming RAG responses
Retrieval metrics & dashboards
Neo4j-first knowledge graphs
FastAPI / Django service mode
Citation confidence scoring
Multi-study comparative dashboards

License

MIT License

Project details

Release history Release notifications | RSS feed

2.0.3

Apr 18, 2026

This version

2.0.2

Apr 11, 2026

2.0.1

Jan 23, 2026

0.2.0

Jan 19, 2026

0.1.19

Apr 11, 2026

0.1.18

Jan 23, 2026

0.1.16

Dec 24, 2025

0.1.15

Dec 22, 2025

0.1.14

Dec 22, 2025

0.1.13

Dec 11, 2025

0.1.12

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbio-2.0.2.tar.gz (194.3 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragbio-2.0.2-py3-none-any.whl (164.2 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file ragbio-2.0.2.tar.gz.

File metadata

Download URL: ragbio-2.0.2.tar.gz
Upload date: Apr 11, 2026
Size: 194.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-2.0.2.tar.gz
Algorithm	Hash digest
SHA256	`1261a234b7b1cb95a0c7f8df763c37298c1650eebfbdf86324a4e92341a809c2`
MD5	`27032f10ca777c24848e07129d3e4098`
BLAKE2b-256	`6eac338ef159055a621a2cc1cde044bf8867c7fe97dc808441cf6d367d947ae8`

See more details on using hashes here.

File details

Details for the file ragbio-2.0.2-py3-none-any.whl.

File metadata

Download URL: ragbio-2.0.2-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 164.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ragbio-2.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3eec6234d3accc6bbb152993ed2e3e3a40ee955eb6198a4cdf8590a25191c3ff`
MD5	`5998cc69149f99607576b368f59e85b0`
BLAKE2b-256	`b969647c5c9bece7cde0cd7e5268e397e5dd7fbc540d1235b730195c269b1e97`

See more details on using hashes here.

ragbio 2.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

RAG-Powered Gene Discovery Assistant (ragbio)

Key Capabilities

Example Questions

High-Level Architecture

Installation

Install from PyPI (recommended)

Development install (from source)

Data Organization (Study-Aware)

Usage Guide

1️⃣ Ingest PubMed Literature (Study-Aware)

2️⃣ Generate Embeddings & Build FAISS Index

3️⃣ Run RAG Queries (CLI)

4️⃣ Instant PMID Retrieval (No LLM)

Python API (Recommended for Integration)

Public API (v1.1)

Instant PMID Retrieval

Structured RAG Output (Literature Summarizer)

Advanced Usage (Long-Lived Assistant)

Multi-Study Search (v1.1)

Caching & Performance

Progress Reporting (UI-Ready)

Technologies Used

Design Principles

Roadmap

v1.1 (current)

v1.2+

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

RAG-Powered Gene Discovery Assistant (`ragbio`)