Give any local Markdown folder a semantic-search MCP server
Project description
mdrag
Give any local Markdown folder a semantic-search MCP server. Runs entirely offline.
Turn ~/Desktop/sales/, ~/Desktop/notes/, or any directory full of Markdown files into a searchable knowledge base that Claude Code, Cursor, Cline, and other MCP clients can query with natural-language questions.
Features
Storage & indexing
- ๐ Multi-vault โ one MCP server manages many doc folders, each a separate "vault"
- ๐ฆ Self-contained โ each vault's vector DB lives inside the folder (
.mdrag/), move it anywhere - โก Incremental indexing โ only re-embed files whose
mtimechanged - ๐ Auto-reindex on save โ
mdrag servewatches every registered vault withwatchdog, 1.5s debounce; new/edited/deleted/moved.mdfiles are picked up with no manual reindex, no cron - ๐
.mdragignoreโ gitignore-style file at the vault root excludes drafts, archives, or whole directories from the index
Retrieval quality
- โ๏ธ Chunk-level retrieval โ long docs are split by headings (sliding-window fallback at 600 chars / 80 overlap) so mid-doc content stays findable; each doc also gets an "overview" chunk for broad queries
- ๐ Hybrid search โ dense vector retrieval fused with BM25 keyword matching via best-rank fusion, so specific terms and semantic intent both get through
- ๐ฏ Rare-term boost โ queries containing digit strings (e.g. "38 ็งๅญๆฎต") switch to a BM25-priority fusion so exact-match lookups aren't buried by vector results
- ๐ Cross-lingual query expansion โ comparison-style queries ("ๅบๅซ", "ๅฏนๆฏ", "compare", "vs") get auto-expanded with bilingual synonyms before embedding, improving recall on mixed-language corpora
- ๐ง Any embedding model โ default is multilingual
paraphrase-multilingual-MiniLM-L12-v2(handles Chinese + English + 50 more); swap in anysentence-transformersmodel
Stability
- ๐ File-locking โ concurrent CLI + watcher reindexes on the same vault are serialized via
filelock, preventing LanceDB corruption - ๐ Schema versioning โ
meta.jsonin each.mdrag/dir tracks schema version and model; mismatches are caught early with an actionable error - ๐ฉบ
mdrag doctorโ one command to check everything: Python, registry, per-vault health, model cache, disk usage, PATH; paste the output into bug reports - ๐ก Watcher health in MCP โ
list_vaultsshows a โ ๏ธ if a vault's auto-reindex is failing (consecutive errors + message), instead of silently serving stale data
Interface
- ๐ Fully local โ no API keys, no cloud; embeddings run on your machine
- ๐ MCP tools โ
list_vaults,search,get_doc,list_tagsexposed to Claude Code / Cursor / Cline over stdio - ๐ก Match explainability โ each search result includes
match_reason("vector+bm25", "bm25 (rare-term)", "bm25 only", "vector only") so AI clients can explain or re-rank - ๐ Quality eval harness โ
mdrag evalcompares any set of indexes on a YAML query suite; Recall@K, MRR, per-query ranking diff - ๐ท Frontmatter-aware โ
title,tags,summaryfrom YAML frontmatter are indexed and searchable
Installation
pip install mdrag
Requires Python โฅ 3.10.
Quickstart (3 steps)
Let's say Bob has a folder ~/Desktop/sales/ full of meeting notes, proposals, and competitor research in Markdown.
1. Register the MCP server (once, globally)
claude mcp add mdrag --scope user -- mdrag serve
This tells Claude Code "there's an MCP server called mdrag โ launch it with mdrag serve when needed". You'll only do this once per machine.
2. Register your doc folder as a vault
mdrag vault add sales ~/Desktop/sales
The first time you run this, a ~100MB embedding model downloads (once), then all .md files under ~/Desktop/sales/ get indexed. A .mdrag/ subfolder is created inside sales/ to hold the vector database.
3. Use it from Claude Code
Open Claude Code in any project. Ask:
"Use the mdrag MCP to search my sales vault for the Q4 pipeline review"
Claude will call mcp__mdrag__search(vault="sales", query="Q4 pipeline review") and return the top matching documents.
Adding another folder
No new MCP config needed โ just register another vault:
mdrag vault add marketing ~/Desktop/marketing
mdrag vault add notes ~/Documents/notes
All vaults are visible through the same MCP server. Claude calls:
mcp__mdrag__list_vaults() โ see all vaults
mcp__mdrag__search(vault="marketing", query="...")
mcp__mdrag__search(vault="notes", query="...")
CLI reference
mdrag serve Start the MCP stdio server
mdrag vault add NAME PATH Register a directory and index it
mdrag vault list Show all vaults
mdrag vault info NAME Show vault details
mdrag vault reindex NAME [--full] Re-index (incremental or full)
mdrag vault remove NAME [--purge] Unregister (and optionally delete .mdrag/)
mdrag eval QUERIES INDEX_SPECS... Compare retrieval quality across indexes
Common options:
--model MODEL_NAMEonvault addโ pick a different embedding model--no-indexonvault addโ skip initial indexing (useful when first adding, want to index later)--fullonvault reindexโ rebuild from scratch (required after changing the model)
MCP tools exposed
When mdrag serve is running, these tools are available to the AI client:
| Tool | Purpose |
|---|---|
list_vaults() |
List all registered vaults with their stats |
search(vault, query, top_k=5, tags="") |
Semantic search within a vault; returns the best-matching chunk per doc with heading_path and chunk_text |
get_doc(vault, path) |
Read the full content of a document |
list_tags(vault) |
List all frontmatter tags in a vault with counts |
Frontmatter (optional)
If your Markdown files have YAML frontmatter, mdrag will use it:
---
title: Q4 Pipeline Review
tags: [sales, forecast, 2026-q4]
summary: Overview of deals in play for Q4 2026.
---
# Q4 Pipeline Review
...
titleโ used as the result title (falls back to filename)tagsโ searchable via thetagsparameter ofsearchsummaryโ shown in search results
No frontmatter? It still works โ mdrag auto-generates a preview from the file body.
Embedding models
| Language | Recommended model | Notes |
|---|---|---|
| Multilingual (default) | paraphrase-multilingual-MiniLM-L12-v2 |
~120MB, handles Chinese + English + 50 more |
| Chinese-only | BAAI/bge-small-zh-v1.5 |
~100MB, higher recall on pure Chinese |
| English-only | BAAI/bge-small-en-v1.5 |
~100MB, higher recall on pure English |
| Higher accuracy | BAAI/bge-base-zh-v1.5 or -en |
~400MB, noticeably slower |
Change the model when registering a vault:
mdrag vault add notes ~/Documents/notes --model BAAI/bge-small-en-v1.5
After changing the model on an existing vault (edit ~/.mdrag/vaults.yaml), run a full rebuild:
mdrag vault reindex notes --full
How it works
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ ~/Desktop/sales/ โ โ ~/.mdrag/ โ
โ meeting-01.md โ โ vaults.yaml โ โ registry
โ proposal.md โ โโโโโโโโโโโโโโโโโโโโโโโโ
โ .mdrag/ โ โ LanceDB vector store (per-vault)
โ docs.lance/ โ
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โ mdrag serve
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FastMCP stdio server โ
โ tools: โ
โ search / get_doc / โ
โ list_vaults / โ
โ list_tags โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ MCP protocol (stdio / JSON-RPC)
โผ
Claude Code / Cursor / Cline
- Vault registry is at
~/.mdrag/vaults.yaml - Each vault's vector database lives inside the vault directory at
.mdrag/โ self-contained, portable - Embeddings use
sentence-transformers, stored in LanceDB - MCP server is built on FastMCP
FAQ
How do I update the index after editing files?
You don't have to. When mdrag serve is running (i.e. Claude Code / Cursor are connected), it watches every registered vault and auto-reindexes on save. A short debounce batches rapid edits.
If serve isn't running, run manual incremental:
mdrag vault reindex sales
Only files with changed mtime are re-embedded.
How do I exclude files from the index?
Put a .mdragignore file at the root of your vault, using gitignore syntax:
# Example: drafts, archives, big log exports
drafts/
archive/**
**/sales-log-*.md
Takes effect on the next index run (auto-watch picks up the change too).
Does it support PDF, DOCX, PPTX, XLSX, etc.?
Not directly โ mdrag only indexes .md. This is by design: conversion is a messy, format-specific
problem, and keeping the core focused on Markdown keeps the index predictable. The recommended
workflow is to convert once, commit the .md output, and let mdrag watch it:
# One-off
pandoc meeting.docx -o docs/meeting.md
pandoc slides.pptx -o docs/slides.md --extract-media=docs/_media
# Bulk conversion with Docling (best quality for PDF/PPTX)
pip install docling
docling raw/*.pdf --to markdown --output docs/
# CSV โ MD table
python -c "import csv,sys; [print('|'+'|'.join(r)+'|') for r in csv.reader(open(sys.argv[1]))]" data.csv > docs/data.md
Important: strip inline base64 images before indexing. A data:image/...;base64,... payload
can inflate a .md file to multi-MB and break chunking. With pandoc use --extract-media=<dir> or
post-process with sed -E 's/!\[[^]]*\]\(data:image[^)]*\)/<!-- image -->/g'.
Model download is slow / fails
If you're in China, set a HuggingFace mirror:
export HF_ENDPOINT=https://hf-mirror.com
mdrag vault add sales ~/Desktop/sales
Where is the vector data stored?
- Vault registry:
~/.mdrag/vaults.yaml - Each vault's vectors:
<vault_path>/.mdrag/docs.lance/
Can I share a vault across machines?
Yes โ the .mdrag/ folder is self-contained. Sync the whole vault directory (via Dropbox, rsync, git-lfs, whatever) and mdrag vault add <name> <path> on the other machine. No re-indexing needed as long as the embedding model matches.
Integrations
Claude Code
claude mcp add mdrag --scope user -- mdrag serve
Or manually in ~/.mcp.json:
{
"mcpServers": {
"mdrag": {
"command": "mdrag",
"args": ["serve"]
}
}
}
Cursor / Cline / other MCP clients
Add the same stdio command to your client's MCP configuration. The command is mdrag serve โ it communicates over stdio following the MCP protocol.
Development
git clone https://github.com/andyleimc-source/mdrag
cd mdrag
python -m venv .venv
.venv/bin/pip install -e .[dev]
.venv/bin/pytest
Try the example vault shipped in the repo:
mdrag vault add demo ./examples/sample-vault
mdrag vault list
License
MIT โ do whatever you want with it.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdrag-0.3.0.tar.gz.
File metadata
- Download URL: mdrag-0.3.0.tar.gz
- Upload date:
- Size: 45.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c101be5f9f863f222299355f6facf6a8a91c2cc025f47f11d2d059a89b80b660
|
|
| MD5 |
114de542cfa9a67341ae9b79ab23fbdc
|
|
| BLAKE2b-256 |
a81c4c884befb7553d67d6f67d500b252103f1fe6f9b65b1d1a11530f34411fe
|
File details
Details for the file mdrag-0.3.0-py3-none-any.whl.
File metadata
- Download URL: mdrag-0.3.0-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
446026e4f4159ff3a931f66989d531b087073f76da13ccc6abc46d1eb0d6ba30
|
|
| MD5 |
0484c641743b813a8791238307f5c67d
|
|
| BLAKE2b-256 |
a65229a79c820c2734c5c3149e1bf2784105ff239a7dc28efed0f66e70abfb2c
|