RAG MCP Server — watches folders, chunks documents, serves semantic search via MCP
Project description
document-rag-mcp
A high-performance Model Context Protocol (MCP) server for local document search and extraction. It recursively scans and watches configured directories for .txt, .md, and .pdf files, indexes their content, and exposes them as tools for LLMs.
📖 Full Documentation: https://janlo.github.io/document-rag-mcp/
Key Features
- Hybrid Search (Semantic + BM25): Blends dense semantic vector search (ChromaDB) with sparse keyword search (SQLite FTS5) using Reciprocal Rank Fusion (RRF) for optimal retrieval.
- Section-Grain Chunking: Text from all pages is unified and chunked as a single stream, then mapped back to its primary page and section via character offsets, preventing artificial boundaries at page borders.
- TOC-Aware Extractor: Extracts PDF headings using the document's own Table of Contents (TOC), falling back to typography-aware layout detection if TOC is missing.
- Incremental Indexing: Uses content hashing (SHA-256) at both the file and chunk levels. Files that have not changed are skipped completely, and modified files only re-embed chunks that actually changed.
- Auto-Pruning: Automatically detects when files are deleted from the disk and prunes them from the index.
- Multimodal OCR: Detects scanned or text-less PDF pages and routes them through an optional vision-capable LLM.
- MCP Native: Exposes tools for hybrid search, collection statistics, metadata analysis, and full document text/binary content retrieval.
Quick Start
1. Installation
Ensure you have uv installed, then synchronize the environment:
git clone https://github.com/janlo/document-rag-mcp.git
cd document-rag-mcp
uv sync --group dev
2. Configuration
Copy the example configuration:
cp config.example.yaml config.yaml
And edit config.yaml to specify the folders you want to watch.
3. CLI Commands
- Ingest:
uv run document-rag-mcp ingest - Search:
uv run document-rag-mcp search "your query" - Start MCP Server:
uv run document-rag-mcp serve
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_rag_mcp-0.2.1.tar.gz.
File metadata
- Download URL: document_rag_mcp-0.2.1.tar.gz
- Upload date:
- Size: 277.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20da60f1ac71aceaa2f456ec9a2be112c2d5e305cd04498ec054a09504f19ad2
|
|
| MD5 |
afae5cb8c6d419305844d70714233491
|
|
| BLAKE2b-256 |
e65dfe440dee3bfaf216e5f0c0b6bca70b8835df8359a734d318984149bef1ef
|
Provenance
The following attestation bundles were made for document_rag_mcp-0.2.1.tar.gz:
Publisher:
release.yml on janLo/document-rag-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
document_rag_mcp-0.2.1.tar.gz -
Subject digest:
20da60f1ac71aceaa2f456ec9a2be112c2d5e305cd04498ec054a09504f19ad2 - Sigstore transparency entry: 1841208543
- Sigstore integration time:
-
Permalink:
janLo/document-rag-mcp@d1898a6c9c0cb0e8b0d29efa965af480ad154f27 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/janLo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1898a6c9c0cb0e8b0d29efa965af480ad154f27 -
Trigger Event:
push
-
Statement type:
File details
Details for the file document_rag_mcp-0.2.1-py3-none-any.whl.
File metadata
- Download URL: document_rag_mcp-0.2.1-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
588a8c6348e3da69b41edc013c555f506f05eae1bc67ef2804cb025fccc4532a
|
|
| MD5 |
c808fd31775a9cd2969d757c7b4cb50c
|
|
| BLAKE2b-256 |
80000857473cc0171ea4d04f3e1e1f4f1e4f925ac9bd9055d544d11ca3c944bb
|
Provenance
The following attestation bundles were made for document_rag_mcp-0.2.1-py3-none-any.whl:
Publisher:
release.yml on janLo/document-rag-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
document_rag_mcp-0.2.1-py3-none-any.whl -
Subject digest:
588a8c6348e3da69b41edc013c555f506f05eae1bc67ef2804cb025fccc4532a - Sigstore transparency entry: 1841208618
- Sigstore integration time:
-
Permalink:
janLo/document-rag-mcp@d1898a6c9c0cb0e8b0d29efa965af480ad154f27 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/janLo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1898a6c9c0cb0e8b0d29efa965af480ad154f27 -
Trigger Event:
push
-
Statement type: