Local stdio MCP server that turns folders of dense documents into a vector embedding DB Claude can search.

These details have not been verified by PyPI

Project links

Project description

vectorise-mcp

Local stdio MCP server that turns folders of dense documents (PDFs, Word, text, markdown) into a hybrid retrieval database that Claude Desktop can semantically search mid-conversation. Built so Claude can effectively work with corpora far larger than its context window — point it at 100M+ tokens of reports and ask questions; it pulls only the relevant chunks.

Fully offline after first model download. No API keys. Free.

Why this is built for quality, not just for ticking a box

Cheap RAG implementations ("just embed with MiniLM and dot-product search") fail badly on dense documents. They miss rare terms, conflate similar sentences, and rank irrelevant chunks at the top. This server uses a stack designed for real retrieval quality:

Stage	What it does	Why
BGE-small-en-v1.5 embeddings	Dense semantic vectors (384-dim, normalized)	Top-tier MTEB scores at small size; far better than MiniLM on technical English.
SQLite FTS5 BM25	Keyword retrieval in parallel	Catches rare terms, names, IDs, acronyms that pure semantic search misses.
Reciprocal Rank Fusion	Merges vector + keyword candidates	Robust hybrid signal — neither side dominates.
bge-reranker-base cross-encoder	Re-scores top-50 jointly with the query	Massive precision boost; cross-encoders consistently rank 5-10 points higher than bi-encoders alone.
Sentence-aware chunking	384-tok chunks, 96-tok overlap, sentence-bounded	Preserves coherence; overlap stops boundary loss.
SHA1 incremental reindex	Only re-embeds changed files	Cheap to keep up to date as the folder evolves.

This is the same retrieval pattern used in production search systems (e.g., Anthropic's contextual retrieval, Vespa hybrid recipes).

Stack

Component	Library
MCP server	`mcp` SDK (FastMCP)
Embeddings	`BAAI/bge-small-en-v1.5` (384-dim, ~130MB, CPU)
Reranker	`BAAI/bge-reranker-base` (~110MB, CPU)
Vector DB	`sqlite-vec` (single-file SQLite extension)
Keyword DB	SQLite FTS5 (BM25)
PDF	`pypdf`
DOCX	`python-docx`

Install

# core (text-based docs only)
pip install vectorise-mcp

# with OCR for scanned PDFs + images (.png .jpg .tiff .bmp .webp)
pip install "vectorise-mcp[ocr]"

vectorise-mcp setup       # downloads ~250MB models (+30MB OCR if installed)

Python ≥ 3.10. pip cannot reliably run post-install hooks (PEP 517), so models download on setup (or first serve boot if you skip setup). After that, fully offline.

OCR: uses rapidocr-onnxruntime (pure Python, ONNX, no system Tesseract install) and pypdfium2 (no Poppler). When installed, scanned PDF pages auto-fall-back to OCR; image files become first-class indexable docs.

Wire into Claude Desktop

Edit claude_desktop_config.json:

Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "vectorise": {
      "command": "vectorise-mcp",
      "args": ["serve"]
    }
  }
}

Ready-to-paste configs are also in the repo:

File	Use
`claude_desktop_config.example.json`	Minimal config — just this server.
`examples/claude_desktop_config.windows.json`	Windows-pathed variant with comment.
`examples/claude_desktop_config.macos.json`	macOS-pathed variant with comment.
`examples/claude_desktop_config.linux.json`	Linux-pathed variant with comment.
`examples/claude_desktop_config.advanced.json`	Pinned Python interpreter + env tuning for GPU.

If you already have other MCP servers in your config, merge the "vectorise" key into your existing mcpServers object — don't overwrite the whole file.

Restart Claude Desktop. Server eagerly loads both models on boot — first tool call is instant, no surprise pause.

Use it

In a Claude Desktop conversation:

"Index C:\Users\me\Documents\Q1Reports and tell me what the revenue projections were."

Claude will:

Call index_folder — streams progress notifications: "Indexing 47 files, ~2 min remaining…"
Call search — hybrid retrieval + cross-encoder rerank → top 5 chunks.
Synthesize answer, citing source file + page.

MCP tools exposed

Tool	What it does
`index_folder(folder_path, collection?)`	Walk folder, embed + index every supported doc. SHA1-safe to re-run.
`reindex_folder(collection)`	Re-scan source folder. Re-embed only changed files; drop deleted.
`list_collections()`	All indexed collections with size, doc count, indexed-at timestamp.
`search(collection, query, k=5, candidate_pool=75, file_glob?, subdirectory?, page_min?, page_max?, min_similarity?)`	Hybrid + reranked. Filters: filename glob, path substring, PDF page range, similarity floor. Claude is encouraged to raise `k` (10-20) and `candidate_pool` (150-300) for hard queries.
`delete_collection(collection)`	Drops the .db file. Returns freed MB.

All tools have docstrings — Claude reads them automatically.

Performance

Metric	Value
Indexing throughput (CPU)	~80 chunks/sec (bge-small)
100-page PDF index time	~3 sec
50 × 100-page PDFs	~3 min
100M-token corpus	~40 min one-time
Search latency (K=5, ≤500K chunks)	~150ms (vector + FTS + rerank)
Disk per chunk	~2 KB

GPU auto-detected by sentence-transformers if PyTorch sees CUDA/MPS — 5-10× faster.

Supported document types

.pdf — page-aware extraction via pypdf. Pages with empty/sparse text auto-fall-back to OCR (if [ocr] extra installed).
.docx — paragraphs + tables via python-docx
.txt, .md, .markdown — UTF-8 text
.png, .jpg, .jpeg, .tiff, .tif, .bmp, .webp — OCR via RapidOCR (requires [ocr] extra)

Unsupported files are skipped silently. Images without OCR installed are skipped with a warning.

Storage

All collections live under ~/.vectorise-mcp/. One .db file per collection, fully self-contained (vector + keyword + chunks + metadata). Portable — copy the file to share, back up the folder for safekeeping.

Configuration via env vars

Var	Default	Purpose
`VECTORISE_MCP_EMBED_MODEL`	`BAAI/bge-small-en-v1.5`	Override embedding model. Must produce 384-dim vectors.
`VECTORISE_MCP_RERANKER_MODEL`	`BAAI/bge-reranker-base`	Override cross-encoder reranker.
`VECTORISE_MCP_EMBED_BATCH`	`32`	Embedding batch size (lower if OOM).
`VECTORISE_MCP_RERANKER_BATCH`	`16`	Rerank batch size.

Troubleshooting

sqlite-vec extension fails to load The PyPI package ships prebuilt binaries; ensure your Python's sqlite has enable_load_extension (true on standard CPython).

Indexing very slow

First file is slow — model loading (~10s).
Install GPU PyTorch for big speedup. Check nvidia-smi / Activity Monitor.

Claude doesn't see the server

Quit Claude Desktop fully; restart.
Validate JSON (no trailing commas).
Run vectorise-mcp serve in a terminal — should hang awaiting stdio. Errors there = install issue.

Search returns weak results

Try larger candidate_pool (e.g. 100). Recall trades against latency.
Check the chunk text in search results — if chunks are tiny, your source may be image-only PDFs (pypdf can't OCR; use a separate OCR step first).

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.2

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorise_mcp-0.8.2.tar.gz (39.3 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vectorise_mcp-0.8.2-py3-none-any.whl (39.9 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file vectorise_mcp-0.8.2.tar.gz.

File metadata

Download URL: vectorise_mcp-0.8.2.tar.gz
Upload date: May 8, 2026
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for vectorise_mcp-0.8.2.tar.gz
Algorithm	Hash digest
SHA256	`78c99a7d4f68c6a6b90ff916f18b1fb8227037c2e1e71a432b57d172e8eebd96`
MD5	`aeadb41feaea0c71ef0e3f66c87f84ff`
BLAKE2b-256	`7bac618dcc6e2ced6e3f48338de85bfab0a3d4fe9ff35be95f2ae2f1dd7206ae`

See more details on using hashes here.

File details

Details for the file vectorise_mcp-0.8.2-py3-none-any.whl.

File metadata

Download URL: vectorise_mcp-0.8.2-py3-none-any.whl
Upload date: May 8, 2026
Size: 39.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for vectorise_mcp-0.8.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3105af6b602455660d94ab797ee211774ca1532ee2d10dea9adec3bc38b56d96`
MD5	`b4977f4109cb094deee2ccfa440b811d`
BLAKE2b-256	`8032dfcce06c020fac20c4d46d3b6546599fab25ec2dee71dd8c9f36510a7dd9`

See more details on using hashes here.

vectorise-mcp 0.8.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vectorise-mcp

Why this is built for quality, not just for ticking a box

Stack

Install

Wire into Claude Desktop

Use it

MCP tools exposed

Performance

Supported document types

Storage

Configuration via env vars

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes