Skip to main content

MCP server for semantic document search with boundary-aware chunking

Project description

Doc Index MCP

What is This For?

A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).

  • Semantic search - Find relevant content using natural language queries
  • Boundary-aware chunking - Respects document structure (chapters, sections, headers)
  • Table extraction - Extract tables from documents as CSV
  • Fully local - No external APIs, no cloud services, no PyTorch
  • Lightweight - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)

Supported Formats

Format Extensions Notes
Text .txt Plain text
Markdown .md, .markdown Preserves headers for boundaries
PDF .pdf Text extraction with page markers
Word .docx Paragraphs, headings, tables
PowerPoint .pptx Slides, notes, tables
Excel .xlsx, .xls Sheets as tables

Why No External Services?

Component Traditional RAG This Server
Embeddings OpenAI API / hosted model Local ONNX model (fastembed)
Vector DB Pinecone / Weaviate / Qdrant Local file (usearch)
Storage Cloud / managed DB Local .docindex/ directory
Dependencies PyTorch (~2GB) ONNX Runtime (~50MB)

Tools

doc_index

Index a document for semantic search.

{
  "file_path": "docs/manual.pdf",
  "source_name": "manual"
}

doc_search

Search indexed documents using natural language.

{
  "query": "how to configure authentication",
  "top_k": 5,
  "expand_to_boundary": "section",
  "max_return_tokens": 4096
}

Parameters:

  • query - Search query
  • sources - Filter to specific sources (optional)
  • top_k - Number of results (default: 5)
  • expand_to_boundary - Expand results to full "section" or "chapter"
  • max_return_tokens - Token budget for results (default: 4096)
  • include_siblings - Include sibling sections when expanding

doc_list

List all indexed sources.

doc_chunk

Retrieve a specific chunk by ID with optional neighbors.

{
  "chunk_id": "manual:42",
  "neighbors": 2
}

read_document

Read a document without indexing. Returns formatted text.

{
  "file_path": "report.pdf",
  "max_chars": 100000
}

list_tables

List all tables in a document.

{
  "file_path": "data.xlsx"
}

extract_table

Extract a specific table as CSV.

{
  "file_path": "data.xlsx",
  "table_index": 0,
  "max_rows": 100
}

Installation

pip install -r requirements.txt

Or with uv:

uv pip install -r requirements.txt

Configuration

Add to your Claude Desktop or MCP client config:

{
  "mcpServers": {
    "doc-index": {
      "command": "python",
      "args": ["/path/to/doc-index-mcp/src/server.py"],
      "env": {
        "MCP_WORKING_DIR": "/path/to/your/project",
        "DOC_INDEX_DIR": "/path/to/store/indices"
      }
    }
  }
}

Environment Variables

Variable Description Default
MCP_WORKING_DIR Base directory for resolving file paths Current working directory
DOC_INDEX_DIR Directory for storing vector indices .docindex in working dir

Architecture

Everything runs locally - no external APIs, databases, or embedding servers required.

flowchart TB
    subgraph Client["MCP Client (Claude Desktop, etc.)"]
        LLM[LLM]
    end

    subgraph MCP["Doc Index MCP Server"]
        Server[server.py]

        subgraph Services["Local Services"]
            Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
            Chunker[Boundary-Aware<br/>Chunker]
            Embedder[Embedder<br/>ONNX Runtime]
            VectorStore[Vector Store<br/>usearch]
        end
    end

    subgraph Storage["Local Filesystem"]
        Docs[(Source<br/>Documents)]
        Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/>    ├── index.usearch<br/>    ├── chunks.jsonl<br/>    └── boundaries.json")]
    end

    subgraph Models["Embedded Model (downloaded once)"]
        ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
    end

    LLM <-->|MCP Protocol| Server
    Server --> Loader
    Server --> Chunker
    Server --> Embedder
    Server --> VectorStore

    Loader -->|read| Docs
    VectorStore <-->|read/write| Index
    Embedder -->|load once| ONNX

    style Client fill:#e1f5fe
    style Storage fill:#fff3e0
    style Models fill:#f3e5f5
    style MCP fill:#e8f5e9

Data Flow

flowchart LR
    subgraph Index["Indexing"]
        direction TB
        A[Document] --> B[Load & Extract Text]
        B --> C[Detect Boundaries]
        C --> D[Chunk ~256 tokens]
        D --> E[Generate Embeddings]
        E --> F[Save to Disk]
    end

    subgraph Search["Searching"]
        direction TB
        G[Query] --> H[Embed Query]
        H --> I[Vector Similarity Search]
        I --> J[Expand to Boundaries]
        J --> K[Return Results]
    end

    Index -.->|stored in .docindex/| Search

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_index_mcp-0.1.0.tar.gz (196.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_index_mcp-0.1.0-py3-none-any.whl (72.8 kB view details)

Uploaded Python 3

File details

Details for the file doc_index_mcp-0.1.0.tar.gz.

File metadata

  • Download URL: doc_index_mcp-0.1.0.tar.gz
  • Upload date:
  • Size: 196.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc_index_mcp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c61f03ed535b84a46ab2357f2f6eacbe823fc9a2f27e0ef392835b9c7782b441
MD5 645256d72cef58c8b8d04a0a3b486b49
BLAKE2b-256 0b44b9aa9d9fa8abdce8a53908a8f2477fd7a2fe3e064cb3c77b7257f005605b

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc_index_mcp-0.1.0.tar.gz:

Publisher: publish.yml on mike-anderson/doc-index-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file doc_index_mcp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doc_index_mcp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc_index_mcp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 175a649a3e290c812685c1508a58dbbe71ddcf10bc17fd63e8fbb42487abf412
MD5 b30ce26e11cea19542019444853ea2c5
BLAKE2b-256 cf0849f9f4d7e6ff5fe76980343f5eb5eabcab62c7524b11ae5ec41145157244

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc_index_mcp-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mike-anderson/doc-index-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page