MCP server for semantic document search with boundary-aware chunking
Project description
Doc Index MCP
What is This For?
A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).
- Semantic search - Find relevant content using natural language queries
- Boundary-aware chunking - Respects document structure (chapters, sections, headers)
- Table extraction - Extract tables from documents as CSV
- Fully local - No external APIs, no cloud services, no PyTorch
- Lightweight - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
| Text | .txt |
Plain text |
| Markdown | .md, .markdown |
Preserves headers for boundaries |
.pdf |
Text extraction with page markers | |
| Word | .docx |
Paragraphs, headings, tables |
| PowerPoint | .pptx |
Slides, notes, tables |
| Excel | .xlsx, .xls |
Sheets as tables |
Why No External Services?
| Component | Traditional RAG | This Server |
|---|---|---|
| Embeddings | OpenAI API / hosted model | Local ONNX model (fastembed) |
| Vector DB | Pinecone / Weaviate / Qdrant | Local file (usearch) |
| Storage | Cloud / managed DB | Local .docindex/ directory |
| Dependencies | PyTorch (~2GB) | ONNX Runtime (~50MB) |
Tools
doc_index
Index a document for semantic search.
{
"file_path": "docs/manual.pdf",
"source_name": "manual"
}
doc_search
Search indexed documents using natural language.
{
"query": "how to configure authentication",
"top_k": 5,
"expand_to_boundary": "section",
"max_return_tokens": 4096
}
Parameters:
query- Search querysources- Filter to specific sources (optional)top_k- Number of results (default: 5)expand_to_boundary- Expand results to full "section" or "chapter"max_return_tokens- Token budget for results (default: 4096)include_siblings- Include sibling sections when expanding
doc_list
List all indexed sources.
doc_chunk
Retrieve a specific chunk by ID with optional neighbors.
{
"chunk_id": "manual:42",
"neighbors": 2
}
read_document
Read a document without indexing. Returns formatted text.
{
"file_path": "report.pdf",
"max_chars": 100000
}
list_tables
List all tables in a document.
{
"file_path": "data.xlsx"
}
extract_table
Extract a specific table as CSV.
{
"file_path": "data.xlsx",
"table_index": 0,
"max_rows": 100
}
Installation
pip install -r requirements.txt
Or with uv:
uv pip install -r requirements.txt
Configuration
Add to your Claude Desktop or MCP client config:
{
"mcpServers": {
"doc-index": {
"command": "python",
"args": ["/path/to/doc-index-mcp/src/server.py"],
"env": {
"MCP_WORKING_DIR": "/path/to/your/project",
"DOC_INDEX_DIR": "/path/to/store/indices"
}
}
}
}
Environment Variables
| Variable | Description | Default |
|---|---|---|
MCP_WORKING_DIR |
Base directory for resolving file paths | Current working directory |
DOC_INDEX_DIR |
Directory for storing vector indices | .docindex in working dir |
Architecture
Everything runs locally - no external APIs, databases, or embedding servers required.
flowchart TB
subgraph Client["MCP Client (Claude Desktop, etc.)"]
LLM[LLM]
end
subgraph MCP["Doc Index MCP Server"]
Server[server.py]
subgraph Services["Local Services"]
Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
Chunker[Boundary-Aware<br/>Chunker]
Embedder[Embedder<br/>ONNX Runtime]
VectorStore[Vector Store<br/>usearch]
end
end
subgraph Storage["Local Filesystem"]
Docs[(Source<br/>Documents)]
Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/> ├── index.usearch<br/> ├── chunks.jsonl<br/> └── boundaries.json")]
end
subgraph Models["Embedded Model (downloaded once)"]
ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
end
LLM <-->|MCP Protocol| Server
Server --> Loader
Server --> Chunker
Server --> Embedder
Server --> VectorStore
Loader -->|read| Docs
VectorStore <-->|read/write| Index
Embedder -->|load once| ONNX
style Client fill:#e1f5fe
style Storage fill:#fff3e0
style Models fill:#f3e5f5
style MCP fill:#e8f5e9
Data Flow
flowchart LR
subgraph Index["Indexing"]
direction TB
A[Document] --> B[Load & Extract Text]
B --> C[Detect Boundaries]
C --> D[Chunk ~256 tokens]
D --> E[Generate Embeddings]
E --> F[Save to Disk]
end
subgraph Search["Searching"]
direction TB
G[Query] --> H[Embed Query]
H --> I[Vector Similarity Search]
I --> J[Expand to Boundaries]
J --> K[Return Results]
end
Index -.->|stored in .docindex/| Search
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc_index_mcp-0.1.0.tar.gz.
File metadata
- Download URL: doc_index_mcp-0.1.0.tar.gz
- Upload date:
- Size: 196.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c61f03ed535b84a46ab2357f2f6eacbe823fc9a2f27e0ef392835b9c7782b441
|
|
| MD5 |
645256d72cef58c8b8d04a0a3b486b49
|
|
| BLAKE2b-256 |
0b44b9aa9d9fa8abdce8a53908a8f2477fd7a2fe3e064cb3c77b7257f005605b
|
Provenance
The following attestation bundles were made for doc_index_mcp-0.1.0.tar.gz:
Publisher:
publish.yml on mike-anderson/doc-index-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doc_index_mcp-0.1.0.tar.gz -
Subject digest:
c61f03ed535b84a46ab2357f2f6eacbe823fc9a2f27e0ef392835b9c7782b441 - Sigstore transparency entry: 1088963786
- Sigstore integration time:
-
Permalink:
mike-anderson/doc-index-mcp@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mike-anderson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file doc_index_mcp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: doc_index_mcp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
175a649a3e290c812685c1508a58dbbe71ddcf10bc17fd63e8fbb42487abf412
|
|
| MD5 |
b30ce26e11cea19542019444853ea2c5
|
|
| BLAKE2b-256 |
cf0849f9f4d7e6ff5fe76980343f5eb5eabcab62c7524b11ae5ec41145157244
|
Provenance
The following attestation bundles were made for doc_index_mcp-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on mike-anderson/doc-index-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doc_index_mcp-0.1.0-py3-none-any.whl -
Subject digest:
175a649a3e290c812685c1508a58dbbe71ddcf10bc17fd63e8fbb42487abf412 - Sigstore transparency entry: 1088963809
- Sigstore integration time:
-
Permalink:
mike-anderson/doc-index-mcp@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mike-anderson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5 -
Trigger Event:
push
-
Statement type: