Skip to main content

Give any local Markdown folder a semantic-search MCP server

Project description

mdrag

Give any local Markdown folder a semantic-search MCP server. Runs entirely offline.

Turn ~/Desktop/sales/, ~/Desktop/notes/, or any directory full of Markdown files into a searchable knowledge base that Claude Code, Cursor, Cline, and other MCP clients can query with natural-language questions.

Features

Storage & indexing

  • ๐Ÿ—‚ Multi-vault โ€” one MCP server manages many doc folders, each a separate "vault"
  • ๐Ÿ“ฆ Self-contained โ€” each vault's vector DB lives inside the folder (.mdrag/), move it anywhere
  • โšก Incremental indexing โ€” only re-embed files whose mtime changed
  • ๐Ÿ‘€ Auto-reindex on save โ€” mdrag serve watches every registered vault with watchdog, 1.5s debounce; new/edited/deleted/moved .md files are picked up with no manual reindex, no cron
  • ๐Ÿ™ˆ .mdragignore โ€” gitignore-style file at the vault root excludes drafts, archives, or whole directories from the index

Retrieval quality

  • โœ‚๏ธ Chunk-level retrieval โ€” long docs are split by headings (sliding-window fallback at 600 chars / 80 overlap) so mid-doc content stays findable; each doc also gets an "overview" chunk for broad queries
  • ๐Ÿ”€ Hybrid search โ€” dense vector retrieval fused with BM25 keyword matching via best-rank fusion, so specific terms and semantic intent both get through
  • ๐ŸŽฏ Rare-term boost โ€” queries containing digit strings (e.g. "38 ็งๅญ—ๆฎต") switch to a BM25-priority fusion so exact-match lookups aren't buried by vector results
  • ๐ŸŒ Cross-lingual query expansion โ€” comparison-style queries ("ๅŒบๅˆซ", "ๅฏนๆฏ”", "compare", "vs") get auto-expanded with bilingual synonyms before embedding, improving recall on mixed-language corpora
  • ๐Ÿง  Any embedding model โ€” default is multilingual paraphrase-multilingual-MiniLM-L12-v2 (handles Chinese + English + 50 more); swap in any sentence-transformers model

Stability

  • ๐Ÿ” File-locking โ€” concurrent CLI + watcher reindexes on the same vault are serialized via filelock, preventing LanceDB corruption
  • ๐Ÿ“‹ Schema versioning โ€” meta.json in each .mdrag/ dir tracks schema version and model; mismatches are caught early with an actionable error
  • ๐Ÿฉบ mdrag doctor โ€” one command to check everything: Python, registry, per-vault health, model cache, disk usage, PATH; paste the output into bug reports
  • ๐Ÿ“ก Watcher health in MCP โ€” list_vaults shows a โš ๏ธ if a vault's auto-reindex is failing (consecutive errors + message), instead of silently serving stale data

Interface

  • ๐Ÿ”’ Fully local โ€” no API keys, no cloud; embeddings run on your machine
  • ๐Ÿ›  MCP tools โ€” list_vaults, search, get_doc, list_tags exposed to Claude Code / Cursor / Cline over stdio
  • ๐Ÿ’ก Match explainability โ€” each search result includes match_reason ("vector+bm25", "bm25 (rare-term)", "bm25 only", "vector only") so AI clients can explain or re-rank
  • ๐Ÿ“ Quality eval harness โ€” mdrag eval compares any set of indexes on a YAML query suite; Recall@K, MRR, per-query ranking diff
  • ๐Ÿท Frontmatter-aware โ€” title, tags, summary from YAML frontmatter are indexed and searchable

Installation

pip install mdrag

Requires Python โ‰ฅ 3.10.


Quickstart (3 steps)

Let's say Bob has a folder ~/Desktop/sales/ full of meeting notes, proposals, and competitor research in Markdown.

1. Register the MCP server (once, globally)

claude mcp add mdrag --scope user -- mdrag serve

This tells Claude Code "there's an MCP server called mdrag โ€” launch it with mdrag serve when needed". You'll only do this once per machine.

2. Register your doc folder as a vault

mdrag vault add sales ~/Desktop/sales

The first time you run this, a ~100MB embedding model downloads (once), then all .md files under ~/Desktop/sales/ get indexed. A .mdrag/ subfolder is created inside sales/ to hold the vector database.

3. Use it from Claude Code

Open Claude Code in any project. Ask:

"Use the mdrag MCP to search my sales vault for the Q4 pipeline review"

Claude will call mcp__mdrag__search(vault="sales", query="Q4 pipeline review") and return the top matching documents.


Adding another folder

No new MCP config needed โ€” just register another vault:

mdrag vault add marketing ~/Desktop/marketing
mdrag vault add notes ~/Documents/notes

All vaults are visible through the same MCP server. Claude calls:

mcp__mdrag__list_vaults()                          โ†’ see all vaults
mcp__mdrag__search(vault="marketing", query="...")
mcp__mdrag__search(vault="notes", query="...")

CLI reference

mdrag serve                          Start the MCP stdio server
mdrag vault add NAME PATH            Register a directory and index it
mdrag vault list                     Show all vaults
mdrag vault info NAME                Show vault details
mdrag vault reindex NAME [--full]    Re-index (incremental or full)
mdrag vault remove NAME [--purge]    Unregister (and optionally delete .mdrag/)
mdrag search VAULT QUERY [-k N]      Search a vault from the shell (debugging)
mdrag eval QUERIES INDEX_SPECS...    Compare retrieval quality across indexes

mdrag search runs the same hybrid retrieval as the MCP search tool โ€” useful for verifying a reindex or debugging without an MCP client. Add --json for machine-readable output, --tags a,b to filter.

Common options:

  • --model MODEL_NAME on vault add โ€” pick a different embedding model
  • --no-index on vault add โ€” skip initial indexing (useful when first adding, want to index later)
  • --full on vault reindex โ€” rebuild from scratch (required after changing the model)

MCP tools exposed

When mdrag serve is running, these tools are available to the AI client:

Tool Purpose
list_vaults() List all registered vaults with their stats
search(vault, query, top_k=5, tags="") Semantic search within a vault; returns the best-matching chunk per doc with heading_path and chunk_text
get_doc(vault, path) Read the full content of a document
list_tags(vault) List all frontmatter tags in a vault with counts

Frontmatter (optional)

If your Markdown files have YAML frontmatter, mdrag will use it:

---
title: Q4 Pipeline Review
tags: [sales, forecast, 2026-q4]
summary: Overview of deals in play for Q4 2026.
---

# Q4 Pipeline Review
...
  • title โ€” used as the result title (falls back to filename)
  • tags โ€” searchable via the tags parameter of search
  • summary โ€” shown in search results

No frontmatter? It still works โ€” mdrag auto-generates a preview from the file body.


Embedding models

Language Recommended model Notes
Multilingual (default) paraphrase-multilingual-MiniLM-L12-v2 ~120MB, handles Chinese + English + 50 more
Chinese-only BAAI/bge-small-zh-v1.5 ~100MB, higher recall on pure Chinese
English-only BAAI/bge-small-en-v1.5 ~100MB, higher recall on pure English
Higher accuracy BAAI/bge-base-zh-v1.5 or -en ~400MB, noticeably slower

Change the model when registering a vault:

mdrag vault add notes ~/Documents/notes --model BAAI/bge-small-en-v1.5

After changing the model on an existing vault (edit ~/.mdrag/vaults.yaml), run a full rebuild:

mdrag vault reindex notes --full

How it works

 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ ~/Desktop/sales/   โ”‚        โ”‚ ~/.mdrag/         โ”‚
 โ”‚   meeting-01.md    โ”‚        โ”‚   vaults.yaml        โ”‚  โ† registry
 โ”‚   proposal.md      โ”‚        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 โ”‚   .mdrag/       โ”‚ โ† LanceDB vector store (per-vault)
 โ”‚     docs.lance/    โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚
            โ”‚ mdrag serve
            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚   FastMCP stdio server   โ”‚
 โ”‚   tools:                 โ”‚
 โ”‚     search / get_doc /   โ”‚
 โ”‚     list_vaults /        โ”‚
 โ”‚     list_tags            โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚ MCP protocol (stdio / JSON-RPC)
            โ–ผ
     Claude Code / Cursor / Cline
  • Vault registry is at ~/.mdrag/vaults.yaml
  • Each vault's vector database lives inside the vault directory at .mdrag/ โ€” self-contained, portable
  • Embeddings use sentence-transformers, stored in LanceDB
  • MCP server is built on FastMCP

FAQ

How do I update the index after editing files?

You don't have to. When mdrag serve is running (i.e. Claude Code / Cursor are connected), it watches every registered vault and auto-reindexes on save. A short debounce batches rapid edits.

If serve isn't running, run manual incremental:

mdrag vault reindex sales

Only files with changed mtime are re-embedded.

How do I exclude files from the index?

Put a .mdragignore file at the root of your vault, using gitignore syntax:

# Example: drafts, archives, big log exports
drafts/
archive/**
**/sales-log-*.md

Takes effect on the next index run (auto-watch picks up the change too).

Does it support PDF, DOCX, PPTX, XLSX, etc.?

Not directly โ€” mdrag only indexes .md. This is by design: conversion is a messy, format-specific problem, and keeping the core focused on Markdown keeps the index predictable.

Use the companion tool mdpack to convert a directory of mixed-format docs to clean Markdown, then point mdrag at the output:

pip install mdpack
brew install pandoc            # needed for DOCX

mdpack convert ~/Desktop/reports                   # writes ~/Desktop/reports/converted/
mdrag vault add reports ~/Desktop/reports/converted

mdpack mirrors the source directory, injects source/converter/converted_at frontmatter so mdrag can trace results back to the original file, and strips inline base64 images (which would otherwise inflate .md files to multi-MB and break chunking). Supports .docx, .xlsx, .csv today; PDF / PPTX / HTML in the 0.2 roadmap.

For one-off conversions without installing mdpack, pandoc still works:

pandoc meeting.docx -o docs/meeting.md --wrap=none

Model download is slow / fails

If you're in China, set a HuggingFace mirror:

export HF_ENDPOINT=https://hf-mirror.com
mdrag vault add sales ~/Desktop/sales

Where is the vector data stored?

  • Vault registry: ~/.mdrag/vaults.yaml
  • Each vault's vectors: <vault_path>/.mdrag/docs.lance/

Can I share a vault across machines?

Yes โ€” the .mdrag/ folder is self-contained. Sync the whole vault directory (via Dropbox, rsync, git-lfs, whatever) and mdrag vault add <name> <path> on the other machine. No re-indexing needed as long as the embedding model matches.


Integrations

Claude Code

claude mcp add mdrag --scope user -- mdrag serve

Or manually in ~/.mcp.json:

{
  "mcpServers": {
    "mdrag": {
      "command": "mdrag",
      "args": ["serve"]
    }
  }
}

Cursor / Cline / other MCP clients

Add the same stdio command to your client's MCP configuration. The command is mdrag serve โ€” it communicates over stdio following the MCP protocol.


Development

git clone https://github.com/andyleimc-source/mdrag
cd mdrag
python -m venv .venv
.venv/bin/pip install -e .[dev]
.venv/bin/pytest

Try the example vault shipped in the repo:

mdrag vault add demo ./examples/sample-vault
mdrag vault list

License

MIT โ€” do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdrag-0.3.2.tar.gz (47.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdrag-0.3.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file mdrag-0.3.2.tar.gz.

File metadata

  • Download URL: mdrag-0.3.2.tar.gz
  • Upload date:
  • Size: 47.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mdrag-0.3.2.tar.gz
Algorithm Hash digest
SHA256 64f7b41ec059840dec215f93be79afbb7f7fb5c086c35da2ebb4d8e5e44c2ac6
MD5 656cb5638345e34095c04425b746fa08
BLAKE2b-256 f4bd0d4e4ebd75747af02a2fec2937fb43b2b6e3402c0376caece2b90d280a9d

See more details on using hashes here.

File details

Details for the file mdrag-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: mdrag-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mdrag-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 678e974f34b66636781408e1dbd55d470d37d721e1e4a229a9c0867a191cb093
MD5 a69fbdcb38e4563ebc363ccbc727d576
BLAKE2b-256 6e18df51146638990d4037908ae7cc2f67ab4f126e63c1f7ae72e1af92210e5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page