Skip to main content

MCP server for semantic document search with boundary-aware chunking

Project description

Doc Index MCP

What is This For?

A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).

  • Semantic search - Find relevant content using natural language queries
  • Boundary-aware chunking - Respects document structure (chapters, sections, headers)
  • Table extraction - Extract tables from documents as CSV
  • Fully local - No external APIs, no cloud services, no PyTorch
  • Lightweight - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)

Supported Formats

Format Extensions Notes
Text .txt Plain text
Markdown .md, .markdown Preserves headers for boundaries
PDF .pdf Text extraction with page markers
Word .docx Paragraphs, headings, tables
PowerPoint .pptx Slides, notes, tables
Excel .xlsx, .xls Sheets as tables

Why No External Services?

Component Traditional RAG This Server
Embeddings OpenAI API / hosted model Local ONNX model (fastembed)
Vector DB Pinecone / Weaviate / Qdrant Local file (usearch)
Storage Cloud / managed DB Local .docindex/ directory
Dependencies PyTorch (~2GB) ONNX Runtime (~50MB)

Tools

doc_index

Index a document for semantic search.

{
  "file_path": "docs/manual.pdf",
  "source_name": "manual"
}

doc_search

Search indexed documents using natural language.

{
  "query": "how to configure authentication",
  "top_k": 5,
  "expand_to_boundary": "section",
  "max_return_tokens": 4096
}

Parameters:

  • query - Search query
  • sources - Filter to specific sources (optional)
  • top_k - Number of results (default: 5)
  • expand_to_boundary - Expand results to full "section" or "chapter"
  • max_return_tokens - Token budget for results (default: 4096)
  • include_siblings - Include sibling sections when expanding

doc_list

List all indexed sources.

doc_chunk

Retrieve a specific chunk by ID with optional neighbors.

{
  "chunk_id": "manual:42",
  "neighbors": 2
}

read_document

Read a document without indexing. Returns formatted text.

{
  "file_path": "report.pdf",
  "max_chars": 100000
}

list_tables

List all tables in a document.

{
  "file_path": "data.xlsx"
}

extract_table

Extract a specific table as CSV.

{
  "file_path": "data.xlsx",
  "table_index": 0,
  "max_rows": 100
}

Installation

pip install -r requirements.txt

Or with uv:

uv pip install -r requirements.txt

Configuration

Add to your Claude Desktop or MCP client config:

{
  "mcpServers": {
    "doc-index": {
      "command": "python",
      "args": ["/path/to/doc-index-mcp/src/server.py"],
      "env": {
        "MCP_WORKING_DIR": "/path/to/your/project",
        "DOC_INDEX_DIR": "/path/to/store/indices"
      }
    }
  }
}

Environment Variables

Variable Description Default
MCP_WORKING_DIR Base directory for resolving file paths Current working directory
DOC_INDEX_DIR Directory for storing vector indices .docindex in working dir

Architecture

Everything runs locally - no external APIs, databases, or embedding servers required.

flowchart TB
    subgraph Client["MCP Client (Claude Desktop, etc.)"]
        LLM[LLM]
    end

    subgraph MCP["Doc Index MCP Server"]
        Server[server.py]

        subgraph Services["Local Services"]
            Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
            Chunker[Boundary-Aware<br/>Chunker]
            Embedder[Embedder<br/>ONNX Runtime]
            VectorStore[Vector Store<br/>usearch]
        end
    end

    subgraph Storage["Local Filesystem"]
        Docs[(Source<br/>Documents)]
        Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/>    ├── index.usearch<br/>    ├── chunks.jsonl<br/>    └── boundaries.json")]
    end

    subgraph Models["Embedded Model (downloaded once)"]
        ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
    end

    LLM <-->|MCP Protocol| Server
    Server --> Loader
    Server --> Chunker
    Server --> Embedder
    Server --> VectorStore

    Loader -->|read| Docs
    VectorStore <-->|read/write| Index
    Embedder -->|load once| ONNX

    style Client fill:#e1f5fe
    style Storage fill:#fff3e0
    style Models fill:#f3e5f5
    style MCP fill:#e8f5e9

Data Flow

flowchart LR
    subgraph Index["Indexing"]
        direction TB
        A[Document] --> B[Load & Extract Text]
        B --> C[Detect Boundaries]
        C --> D[Chunk ~256 tokens]
        D --> E[Generate Embeddings]
        E --> F[Save to Disk]
    end

    subgraph Search["Searching"]
        direction TB
        G[Query] --> H[Embed Query]
        H --> I[Vector Similarity Search]
        I --> J[Expand to Boundaries]
        J --> K[Return Results]
    end

    Index -.->|stored in .docindex/| Search

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_index_mcp-0.1.0.dev0.tar.gz (196.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_index_mcp-0.1.0.dev0-py3-none-any.whl (73.1 kB view details)

Uploaded Python 3

File details

Details for the file doc_index_mcp-0.1.0.dev0.tar.gz.

File metadata

  • Download URL: doc_index_mcp-0.1.0.dev0.tar.gz
  • Upload date:
  • Size: 196.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doc_index_mcp-0.1.0.dev0.tar.gz
Algorithm Hash digest
SHA256 841d1c915734439a70e0e146dc802efb111025029c15b46e96adc1ef2bca5490
MD5 de7e744954ffcc36d190673e0e0dff83
BLAKE2b-256 927ea3f4b28facf3dc3dcef9d04e207dd90ffed3345156d5d0dc31c088c8b534

See more details on using hashes here.

File details

Details for the file doc_index_mcp-0.1.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: doc_index_mcp-0.1.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 73.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doc_index_mcp-0.1.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 476a79f7e6a90e50ffb5aa7494290e93f8c509344cddd0f5f4dc7362fe960d77
MD5 8515de42728ddbed63b4368875aa21a0
BLAKE2b-256 6e1c86ea50b76f6b604a2f1c4c23bcfebaeb86be5067da62c1c9a674fb3a87eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page