Skip to main content

Local stdio MCP server that turns folders of dense documents into a vector embedding DB Claude can search.

Project description

vectorise-mcp

Local stdio MCP server that turns folders of dense documents (PDFs, Word, text, markdown) into a hybrid retrieval database that Claude Desktop can semantically search mid-conversation. Built so Claude can effectively work with corpora far larger than its context window — point it at 100M+ tokens of reports and ask questions; it pulls only the relevant chunks.

Fully offline after first model download. No API keys. Free.

Why this is built for quality, not just for ticking a box

Cheap RAG implementations ("just embed with MiniLM and dot-product search") fail badly on dense documents. They miss rare terms, conflate similar sentences, and rank irrelevant chunks at the top. This server uses a stack designed for real retrieval quality:

Stage What it does Why
BGE-small-en-v1.5 embeddings Dense semantic vectors (384-dim, normalized) Top-tier MTEB scores at small size; far better than MiniLM on technical English.
SQLite FTS5 BM25 Keyword retrieval in parallel Catches rare terms, names, IDs, acronyms that pure semantic search misses.
Reciprocal Rank Fusion Merges vector + keyword candidates Robust hybrid signal — neither side dominates.
bge-reranker-base cross-encoder Re-scores top-50 jointly with the query Massive precision boost; cross-encoders consistently rank 5-10 points higher than bi-encoders alone.
Sentence-aware chunking 384-tok chunks, 96-tok overlap, sentence-bounded Preserves coherence; overlap stops boundary loss.
SHA1 incremental reindex Only re-embeds changed files Cheap to keep up to date as the folder evolves.

This is the same retrieval pattern used in production search systems (e.g., Anthropic's contextual retrieval, Vespa hybrid recipes).

Stack

Component Library
MCP server mcp SDK (FastMCP)
Embeddings BAAI/bge-small-en-v1.5 (384-dim, ~130MB, CPU)
Reranker BAAI/bge-reranker-base (~110MB, CPU)
Vector DB sqlite-vec (single-file SQLite extension)
Keyword DB SQLite FTS5 (BM25)
PDF pypdf
DOCX python-docx

Install

# core (text-based docs only)
pip install vectorise-mcp

# with OCR for scanned PDFs + images (.png .jpg .tiff .bmp .webp)
pip install "vectorise-mcp[ocr]"

vectorise-mcp setup       # downloads ~250MB models (+30MB OCR if installed)

Python ≥ 3.10. pip cannot reliably run post-install hooks (PEP 517), so models download on setup (or first serve boot if you skip setup). After that, fully offline.

OCR: uses rapidocr-onnxruntime (pure Python, ONNX, no system Tesseract install) and pypdfium2 (no Poppler). When installed, scanned PDF pages auto-fall-back to OCR; image files become first-class indexable docs.

Wire into Claude Desktop

Edit claude_desktop_config.json:

  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "vectorise": {
      "command": "vectorise-mcp",
      "args": ["serve"]
    }
  }
}

Ready-to-paste configs are also in the repo:

File Use
claude_desktop_config.example.json Minimal config — just this server.
examples/claude_desktop_config.windows.json Windows-pathed variant with comment.
examples/claude_desktop_config.macos.json macOS-pathed variant with comment.
examples/claude_desktop_config.linux.json Linux-pathed variant with comment.
examples/claude_desktop_config.advanced.json Pinned Python interpreter + env tuning for GPU.

If you already have other MCP servers in your config, merge the "vectorise" key into your existing mcpServers object — don't overwrite the whole file.

Restart Claude Desktop. Server eagerly loads both models on boot — first tool call is instant, no surprise pause.

Use it

In a Claude Desktop conversation:

"Index C:\Users\me\Documents\Q1Reports and tell me what the revenue projections were."

Claude will:

  1. Call index_folder — streams progress notifications: "Indexing 47 files, ~2 min remaining…"
  2. Call search — hybrid retrieval + cross-encoder rerank → top 5 chunks.
  3. Synthesize answer, citing source file + page.

MCP tools exposed

Tool What it does
index_folder(folder_path, collection?) Walk folder, embed + index every supported doc. SHA1-safe to re-run.
reindex_folder(collection) Re-scan source folder. Re-embed only changed files; drop deleted.
list_collections() All indexed collections with size, doc count, indexed-at timestamp.
search(collection, query, k=5, candidate_pool=75, file_glob?, subdirectory?, page_min?, page_max?, min_similarity?) Hybrid + reranked. Filters: filename glob, path substring, PDF page range, similarity floor. Claude is encouraged to raise k (10-20) and candidate_pool (150-300) for hard queries.
delete_collection(collection) Drops the .db file. Returns freed MB.

All tools have docstrings — Claude reads them automatically.

Performance

Metric Value
Indexing throughput (CPU) ~80 chunks/sec (bge-small)
100-page PDF index time ~3 sec
50 × 100-page PDFs ~3 min
100M-token corpus ~40 min one-time
Search latency (K=5, ≤500K chunks) ~150ms (vector + FTS + rerank)
Disk per chunk ~2 KB

GPU auto-detected by sentence-transformers if PyTorch sees CUDA/MPS — 5-10× faster.

Supported document types

  • .pdf — page-aware extraction via pypdf. Pages with empty/sparse text auto-fall-back to OCR (if [ocr] extra installed).
  • .docx — paragraphs + tables via python-docx
  • .txt, .md, .markdown — UTF-8 text
  • .png, .jpg, .jpeg, .tiff, .tif, .bmp, .webp — OCR via RapidOCR (requires [ocr] extra)

Unsupported files are skipped silently. Images without OCR installed are skipped with a warning.

Storage

All collections live under ~/.vectorise-mcp/. One .db file per collection, fully self-contained (vector + keyword + chunks + metadata). Portable — copy the file to share, back up the folder for safekeeping.

Configuration via env vars

Var Default Purpose
VECTORISE_MCP_EMBED_MODEL BAAI/bge-small-en-v1.5 Override embedding model. Must produce 384-dim vectors.
VECTORISE_MCP_RERANKER_MODEL BAAI/bge-reranker-base Override cross-encoder reranker.
VECTORISE_MCP_EMBED_BATCH 32 Embedding batch size (lower if OOM).
VECTORISE_MCP_RERANKER_BATCH 16 Rerank batch size.

Troubleshooting

sqlite-vec extension fails to load The PyPI package ships prebuilt binaries; ensure your Python's sqlite has enable_load_extension (true on standard CPython).

Indexing very slow

  • First file is slow — model loading (~10s).
  • Install GPU PyTorch for big speedup. Check nvidia-smi / Activity Monitor.

Claude doesn't see the server

  • Quit Claude Desktop fully; restart.
  • Validate JSON (no trailing commas).
  • Run vectorise-mcp serve in a terminal — should hang awaiting stdio. Errors there = install issue.

Search returns weak results

  • Try larger candidate_pool (e.g. 100). Recall trades against latency.
  • Check the chunk text in search results — if chunks are tiny, your source may be image-only PDFs (pypdf can't OCR; use a separate OCR step first).

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorise_mcp-0.8.2.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorise_mcp-0.8.2-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file vectorise_mcp-0.8.2.tar.gz.

File metadata

  • Download URL: vectorise_mcp-0.8.2.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for vectorise_mcp-0.8.2.tar.gz
Algorithm Hash digest
SHA256 78c99a7d4f68c6a6b90ff916f18b1fb8227037c2e1e71a432b57d172e8eebd96
MD5 aeadb41feaea0c71ef0e3f66c87f84ff
BLAKE2b-256 7bac618dcc6e2ced6e3f48338de85bfab0a3d4fe9ff35be95f2ae2f1dd7206ae

See more details on using hashes here.

File details

Details for the file vectorise_mcp-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: vectorise_mcp-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for vectorise_mcp-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3105af6b602455660d94ab797ee211774ca1532ee2d10dea9adec3bc38b56d96
MD5 b4977f4109cb094deee2ccfa440b811d
BLAKE2b-256 8032dfcce06c020fac20c4d46d3b6546599fab25ec2dee71dd8c9f36510a7dd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page