MCP server for semantic document search with boundary-aware chunking

Project description

Doc Index MCP

What is This For?

A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).

Semantic search - Find relevant content using natural language queries
Boundary-aware chunking - Respects document structure (chapters, sections, headers)
Table extraction - Extract tables from documents as CSV
Fully local - No external APIs, no cloud services, no PyTorch
Lightweight - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)

Supported Formats

Format	Extensions	Notes
Text	`.txt`	Plain text
Markdown	`.md`, `.markdown`	Preserves headers for boundaries
PDF	`.pdf`	Text extraction with page markers
Word	`.docx`	Paragraphs, headings, tables
PowerPoint	`.pptx`	Slides, notes, tables
Excel	`.xlsx`, `.xls`	Sheets as tables

Why No External Services?

Component	Traditional RAG	This Server
Embeddings	OpenAI API / hosted model	Local ONNX model (fastembed)
Vector DB	Pinecone / Weaviate / Qdrant	Local file (usearch)
Storage	Cloud / managed DB	Local `.docindex/` directory
Dependencies	PyTorch (~2GB)	ONNX Runtime (~50MB)

Tools

`doc_index`

Index a document for semantic search.

{
  "file_path": "docs/manual.pdf",
  "source_name": "manual"
}

`doc_search`

Search indexed documents using natural language.

{
  "query": "how to configure authentication",
  "top_k": 5,
  "expand_to_boundary": "section",
  "max_return_tokens": 4096
}

Parameters:

query - Search query
sources - Filter to specific sources (optional)
top_k - Number of results (default: 5)
expand_to_boundary - Expand results to full "section" or "chapter"
max_return_tokens - Token budget for results (default: 4096)
include_siblings - Include sibling sections when expanding

`doc_list`

List all indexed sources.

`doc_chunk`

Retrieve a specific chunk by ID with optional neighbors.

{
  "chunk_id": "manual:42",
  "neighbors": 2
}

`read_document`

Read a document without indexing. Returns formatted text.

{
  "file_path": "report.pdf",
  "max_chars": 100000
}

`list_tables`

List all tables in a document.

{
  "file_path": "data.xlsx"
}

`extract_table`

Extract a specific table as CSV.

{
  "file_path": "data.xlsx",
  "table_index": 0,
  "max_rows": 100
}

Installation

pip install -r requirements.txt

Or with uv:

uv pip install -r requirements.txt

Configuration

Add to your Claude Desktop or MCP client config:

{
  "mcpServers": {
    "doc-index": {
      "command": "python",
      "args": ["/path/to/doc-index-mcp/src/server.py"],
      "env": {
        "MCP_WORKING_DIR": "/path/to/your/project",
        "DOC_INDEX_DIR": "/path/to/store/indices"
      }
    }
  }
}

Environment Variables

Variable	Description	Default
`MCP_WORKING_DIR`	Base directory for resolving file paths	Current working directory
`DOC_INDEX_DIR`	Directory for storing vector indices	`.docindex` in working dir

Architecture

Everything runs locally - no external APIs, databases, or embedding servers required.

flowchart TB
    subgraph Client["MCP Client (Claude Desktop, etc.)"]
        LLM[LLM]
    end

    subgraph MCP["Doc Index MCP Server"]
        Server[server.py]

        subgraph Services["Local Services"]
            Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
            Chunker[Boundary-Aware<br/>Chunker]
            Embedder[Embedder<br/>ONNX Runtime]
            VectorStore[Vector Store<br/>usearch]
        end
    end

    subgraph Storage["Local Filesystem"]
        Docs[(Source<br/>Documents)]
        Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/>    ├── index.usearch<br/>    ├── chunks.jsonl<br/>    └── boundaries.json")]
    end

    subgraph Models["Embedded Model (downloaded once)"]
        ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
    end

    LLM <-->|MCP Protocol| Server
    Server --> Loader
    Server --> Chunker
    Server --> Embedder
    Server --> VectorStore

    Loader -->|read| Docs
    VectorStore <-->|read/write| Index
    Embedder -->|load once| ONNX

    style Client fill:#e1f5fe
    style Storage fill:#fff3e0
    style Models fill:#f3e5f5
    style MCP fill:#e8f5e9

Data Flow

flowchart LR
    subgraph Index["Indexing"]
        direction TB
        A[Document] --> B[Load & Extract Text]
        B --> C[Detect Boundaries]
        C --> D[Chunk ~256 tokens]
        D --> E[Generate Embeddings]
        E --> F[Save to Disk]
    end

    subgraph Search["Searching"]
        direction TB
        G[Query] --> H[Embed Query]
        H --> I[Vector Similarity Search]
        I --> J[Expand to Boundaries]
        J --> K[Return Results]
    end

    Index -.->|stored in .docindex/| Search

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Mar 12, 2026

0.1.0.dev0 pre-release

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_index_mcp-0.1.0.tar.gz (196.6 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_index_mcp-0.1.0-py3-none-any.whl (72.8 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file doc_index_mcp-0.1.0.tar.gz.

File metadata

Download URL: doc_index_mcp-0.1.0.tar.gz
Upload date: Mar 12, 2026
Size: 196.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc_index_mcp-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c61f03ed535b84a46ab2357f2f6eacbe823fc9a2f27e0ef392835b9c7782b441`
MD5	`645256d72cef58c8b8d04a0a3b486b49`
BLAKE2b-256	`0b44b9aa9d9fa8abdce8a53908a8f2477fd7a2fe3e064cb3c77b7257f005605b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc_index_mcp-0.1.0.tar.gz:

Publisher: publish.yml on mike-anderson/doc-index-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doc_index_mcp-0.1.0.tar.gz
- Subject digest: c61f03ed535b84a46ab2357f2f6eacbe823fc9a2f27e0ef392835b9c7782b441
- Sigstore transparency entry: 1088963786
- Sigstore integration time: Mar 12, 2026
Source repository:
- Permalink: mike-anderson/doc-index-mcp@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mike-anderson
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5
- Trigger Event: push

File details

Details for the file doc_index_mcp-0.1.0-py3-none-any.whl.

File metadata

Download URL: doc_index_mcp-0.1.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 72.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc_index_mcp-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`175a649a3e290c812685c1508a58dbbe71ddcf10bc17fd63e8fbb42487abf412`
MD5	`b30ce26e11cea19542019444853ea2c5`
BLAKE2b-256	`cf0849f9f4d7e6ff5fe76980343f5eb5eabcab62c7524b11ae5ec41145157244`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc_index_mcp-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mike-anderson/doc-index-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doc_index_mcp-0.1.0-py3-none-any.whl
- Subject digest: 175a649a3e290c812685c1508a58dbbe71ddcf10bc17fd63e8fbb42487abf412
- Sigstore transparency entry: 1088963809
- Sigstore integration time: Mar 12, 2026
Source repository:
- Permalink: mike-anderson/doc-index-mcp@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mike-anderson
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d0765bfcd9c6e8d0c33363c3c7a3d1f4895cb9e5
- Trigger Event: push

doc-index-mcp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Doc Index MCP

What is This For?

Supported Formats

Why No External Services?

Tools

doc_index

doc_search

doc_list

doc_chunk

read_document

list_tables

extract_table

Installation

Configuration

Environment Variables

Architecture

Data Flow

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`doc_index`

`doc_search`

`doc_list`

`doc_chunk`

`read_document`

`list_tables`

`extract_table`