Local stdio MCP server that turns folders of dense documents into a vector embedding DB Claude can search.
Project description
vectorise-mcp
Local stdio MCP server that turns folders of dense documents (PDFs, Word, text, markdown) into a hybrid retrieval database that Claude Desktop can semantically search mid-conversation. Built so Claude can effectively work with corpora far larger than its context window — point it at 100M+ tokens of reports and ask questions; it pulls only the relevant chunks.
Fully offline after first model download. No API keys. Free.
Why this is built for quality, not just for ticking a box
Cheap RAG implementations ("just embed with MiniLM and dot-product search") fail badly on dense documents. They miss rare terms, conflate similar sentences, and rank irrelevant chunks at the top. This server uses a stack designed for real retrieval quality:
| Stage | What it does | Why |
|---|---|---|
| BGE-small-en-v1.5 embeddings | Dense semantic vectors (384-dim, normalized) | Top-tier MTEB scores at small size; far better than MiniLM on technical English. |
| SQLite FTS5 BM25 | Keyword retrieval in parallel | Catches rare terms, names, IDs, acronyms that pure semantic search misses. |
| Reciprocal Rank Fusion | Merges vector + keyword candidates | Robust hybrid signal — neither side dominates. |
| bge-reranker-base cross-encoder | Re-scores top-50 jointly with the query | Massive precision boost; cross-encoders consistently rank 5-10 points higher than bi-encoders alone. |
| Sentence-aware chunking | 384-tok chunks, 96-tok overlap, sentence-bounded | Preserves coherence; overlap stops boundary loss. |
| SHA1 incremental reindex | Only re-embeds changed files | Cheap to keep up to date as the folder evolves. |
This is the same retrieval pattern used in production search systems (e.g., Anthropic's contextual retrieval, Vespa hybrid recipes).
Stack
| Component | Library |
|---|---|
| MCP server | mcp SDK (FastMCP) |
| Embeddings | BAAI/bge-small-en-v1.5 (384-dim, ~130MB, CPU) |
| Reranker | BAAI/bge-reranker-base (~110MB, CPU) |
| Vector DB | sqlite-vec (single-file SQLite extension) |
| Keyword DB | SQLite FTS5 (BM25) |
pypdf |
|
| DOCX | python-docx |
Install
# core (text-based docs only)
pip install vectorise-mcp
# with OCR for scanned PDFs + images (.png .jpg .tiff .bmp .webp)
pip install "vectorise-mcp[ocr]"
vectorise-mcp setup # downloads ~250MB models (+30MB OCR if installed)
Python ≥ 3.10. pip cannot reliably run post-install hooks (PEP 517), so models download on setup (or first serve boot if you skip setup). After that, fully offline.
OCR: uses rapidocr-onnxruntime (pure Python, ONNX, no system Tesseract install) and pypdfium2 (no Poppler). When installed, scanned PDF pages auto-fall-back to OCR; image files become first-class indexable docs.
Wire into Claude Desktop
Edit claude_desktop_config.json:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json - macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"vectorise": {
"command": "vectorise-mcp",
"args": ["serve"]
}
}
}
Ready-to-paste configs are also in the repo:
| File | Use |
|---|---|
claude_desktop_config.example.json |
Minimal config — just this server. |
examples/claude_desktop_config.windows.json |
Windows-pathed variant with comment. |
examples/claude_desktop_config.macos.json |
macOS-pathed variant with comment. |
examples/claude_desktop_config.linux.json |
Linux-pathed variant with comment. |
examples/claude_desktop_config.advanced.json |
Pinned Python interpreter + env tuning for GPU. |
If you already have other MCP servers in your config, merge the "vectorise" key into your existing mcpServers object — don't overwrite the whole file.
Restart Claude Desktop. Server eagerly loads both models on boot — first tool call is instant, no surprise pause.
Use it
In a Claude Desktop conversation:
"Index
C:\Users\me\Documents\Q1Reportsand tell me what the revenue projections were."
Claude will:
- Call
index_folder— streams progress notifications: "Indexing 47 files, ~2 min remaining…" - Call
search— hybrid retrieval + cross-encoder rerank → top 5 chunks. - Synthesize answer, citing source file + page.
MCP tools exposed
| Tool | What it does |
|---|---|
index_folder(folder_path, collection?) |
Walk folder, embed + index every supported doc. SHA1-safe to re-run. |
reindex_folder(collection) |
Re-scan source folder. Re-embed only changed files; drop deleted. |
list_collections() |
All indexed collections with size, doc count, indexed-at timestamp. |
search(collection, query, k=5, candidate_pool=75, file_glob?, subdirectory?, page_min?, page_max?, min_similarity?) |
Hybrid + reranked. Filters: filename glob, path substring, PDF page range, similarity floor. Claude is encouraged to raise k (10-20) and candidate_pool (150-300) for hard queries. |
delete_collection(collection) |
Drops the .db file. Returns freed MB. |
All tools have docstrings — Claude reads them automatically.
Performance
| Metric | Value |
|---|---|
| Indexing throughput (CPU) | ~80 chunks/sec (bge-small) |
| 100-page PDF index time | ~3 sec |
| 50 × 100-page PDFs | ~3 min |
| 100M-token corpus | ~40 min one-time |
| Search latency (K=5, ≤500K chunks) | ~150ms (vector + FTS + rerank) |
| Disk per chunk | ~2 KB |
GPU auto-detected by sentence-transformers if PyTorch sees CUDA/MPS — 5-10× faster.
Supported document types
.pdf— page-aware extraction viapypdf. Pages with empty/sparse text auto-fall-back to OCR (if[ocr]extra installed)..docx— paragraphs + tables viapython-docx.txt,.md,.markdown— UTF-8 text.png,.jpg,.jpeg,.tiff,.tif,.bmp,.webp— OCR via RapidOCR (requires[ocr]extra)
Unsupported files are skipped silently. Images without OCR installed are skipped with a warning.
Storage
All collections live under ~/.vectorise-mcp/. One .db file per collection, fully self-contained (vector + keyword + chunks + metadata). Portable — copy the file to share, back up the folder for safekeeping.
Configuration via env vars
| Var | Default | Purpose |
|---|---|---|
VECTORISE_MCP_EMBED_MODEL |
BAAI/bge-small-en-v1.5 |
Override embedding model. Must produce 384-dim vectors. |
VECTORISE_MCP_RERANKER_MODEL |
BAAI/bge-reranker-base |
Override cross-encoder reranker. |
VECTORISE_MCP_EMBED_BATCH |
32 |
Embedding batch size (lower if OOM). |
VECTORISE_MCP_RERANKER_BATCH |
16 |
Rerank batch size. |
Troubleshooting
sqlite-vec extension fails to load
The PyPI package ships prebuilt binaries; ensure your Python's sqlite has enable_load_extension (true on standard CPython).
Indexing very slow
- First file is slow — model loading (~10s).
- Install GPU PyTorch for big speedup. Check
nvidia-smi/ Activity Monitor.
Claude doesn't see the server
- Quit Claude Desktop fully; restart.
- Validate JSON (no trailing commas).
- Run
vectorise-mcp servein a terminal — should hang awaiting stdio. Errors there = install issue.
Search returns weak results
- Try larger
candidate_pool(e.g. 100). Recall trades against latency. - Check the chunk text in
searchresults — if chunks are tiny, your source may be image-only PDFs (pypdf can't OCR; use a separate OCR step first).
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectorise_mcp-0.8.2.tar.gz.
File metadata
- Download URL: vectorise_mcp-0.8.2.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78c99a7d4f68c6a6b90ff916f18b1fb8227037c2e1e71a432b57d172e8eebd96
|
|
| MD5 |
aeadb41feaea0c71ef0e3f66c87f84ff
|
|
| BLAKE2b-256 |
7bac618dcc6e2ced6e3f48338de85bfab0a3d4fe9ff35be95f2ae2f1dd7206ae
|
File details
Details for the file vectorise_mcp-0.8.2-py3-none-any.whl.
File metadata
- Download URL: vectorise_mcp-0.8.2-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3105af6b602455660d94ab797ee211774ca1532ee2d10dea9adec3bc38b56d96
|
|
| MD5 |
b4977f4109cb094deee2ccfa440b811d
|
|
| BLAKE2b-256 |
8032dfcce06c020fac20c4d46d3b6546599fab25ec2dee71dd8c9f36510a7dd9
|