MCP server for local document indexing and search using LanceDB

Project description

MCP Document Indexer

A Python-based MCP (Model Context Protocol) server for local document indexing and search using LanceDB vector database and local LLMs.

Features

Real-time Document Monitoring: Automatically indexes new and modified documents in configured folders
Multi-format Support: Handles PDF, Word (docx/doc), text, Markdown, and RTF files
Local LLM Integration: Uses Ollama for document summarization and keyword extraction. Nothing ever leaves your computer
Vector Search: Semantic search using LanceDB and sentence transformers
MCP Integration: Exposes search and catalog tools via Model Context Protocol
Incremental Indexing: Only processes changed files to save resources
Performance Optimized: Designed for decent performance on standard laptops (e.g. M1/M2 MacBook)

Installation

Prerequisites

Python 3.9+ installed
uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

Ollama (for local LLM):

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (e.g., llama3.2)
ollama pull llama3.2:3b

Install MCP Document Indexer

# Clone the repository
git clone https://github.com/yairwein/mcp-doc-indexer.git
cd mcp-doc-indexer

# Install with uv
uv sync

# Or install as a package
uv add mcp-doc-indexer

Configuration

Configure the indexer using environment variables or a .env file:

# Folders to monitor (comma-separated)
WATCH_FOLDERS="/Users/me/Documents,/Users/me/Research"

# LanceDB storage path
LANCEDB_PATH="./vector_index"

# Ollama model for summarization
LLM_MODEL="llama3.2:3b"

# Text chunking settings
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Embedding model (sentence-transformers)
EMBEDDING_MODEL="all-MiniLM-L6-v2"

# File types to index
FILE_EXTENSIONS=".pdf,.docx,.doc,.txt,.md,.rtf"

# Maximum file size in MB
MAX_FILE_SIZE_MB=100

# Ollama API URL
OLLAMA_BASE_URL="http://localhost:11434"

Usage

Run as Standalone Service

# Set environment variables
export WATCH_FOLDERS="/path/to/documents"
export LANCEDB_PATH="./my_index"

# Run the indexer
uv run python -m src.main

Integrate with Claude Desktop

Add to your Claude Desktop configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "doc-indexer": {
      "command": "uv",
      "args": [
        "run",
        "--directory",
        "/path/to/mcp-doc-indexer",
        "python",
        "-m",
        "src.main"
      ],
      "env": {
        "WATCH_FOLDERS": "/Users/me/Documents,/Users/me/Research",
        "LANCEDB_PATH": "/Users/me/.mcp-doc-index",
        "LLM_MODEL": "llama3.2:3b"
      }
    }
  }
}

MCP Tools

The indexer exposes the following tools via MCP:

`search_documents`

Search for documents using natural language queries.

Parameters:
- query: Search query text
- limit: Maximum number of results (default: 10)
- search_type: "documents" or "chunks"

`get_catalog`

List all indexed documents with summaries.

Parameters:
- skip: Number of documents to skip (default: 0)
- limit: Maximum documents to return (default: 100)

`get_document_info`

Get detailed information about a specific document.

Parameters:
- file_path: Path to the document

`reindex_document`

Force reindexing of a specific document.

Parameters:
- file_path: Path to the document to reindex

`get_indexing_stats`

Get current indexing statistics.

Example Usage in Claude

Once configured, you can use the indexer in Claude:

"Search my documents for information about machine learning"
"Show me all PDFs I've indexed"
"What documents mention Python programming?"
"Get details about /Users/me/Documents/report.pdf"
"Reindex the latest version of my thesis"

Architecture

┌─────────────────┐     ┌──────────────┐     ┌─────────────┐
│  File Monitor   │────▶│   Document   │────▶│  Local LLM  │
│   (Watchdog)    │     │    Parser    │     │  (Ollama)   │
└─────────────────┘     └──────────────┘     └─────────────┘
                               │                      │
                               ▼                      ▼
                        ┌──────────────┐     ┌─────────────┐
                        │   LanceDB    │◀────│  Embeddings │
                        │   Storage    │     │  (ST Model) │
                        └──────────────┘     └─────────────┘
                               │
                               ▼
                        ┌──────────────┐
                        │  FastMCP     │
                        │   Server     │
                        └──────────────┘
                               │
                               ▼
                        ┌──────────────┐
                        │    Claude    │
                        │   Desktop    │
                        └──────────────┘

File Processing Pipeline

File Detection: Watchdog monitors configured folders for changes
Document Parsing: Extracts text from PDF, Word, and text files
Text Chunking: Splits documents into overlapping chunks for better retrieval
LLM Processing: Generates summaries and extracts keywords using Ollama
Embedding Generation: Creates vector embeddings using sentence transformers
Vector Storage: Stores documents and chunks in LanceDB
MCP Exposure: Makes search and catalog tools available via MCP

Performance Considerations

Incremental Indexing: Only changed files are reprocessed
Async Processing: Parallel processing of multiple documents
Batch Operations: Efficient batch indexing for multiple files
Debouncing: Prevents duplicate processing of rapidly changing files
Size Limits: Configurable maximum file size to prevent memory issues

Troubleshooting

Ollama Not Available

If Ollama is not running or the model isn't available, the indexer falls back to simple text extraction without summarization.

# Check Ollama status
ollama list

# Pull required model
ollama pull llama3.2:3b

Permission Issues

Ensure the indexer has read access to monitored folders:

chmod -R 755 /path/to/documents

Memory Usage

For large document collections, consider:

Reducing CHUNK_SIZE to create smaller chunks
Limiting MAX_FILE_SIZE_MB to skip very large files
Using a smaller embedding model

Development

Running Tests

uv run pytest tests/

Code Formatting

uv run black src/
uv run ruff src/

Building Package

uv build

License

MIT License - See LICENSE file for details

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

Support

For issues or questions:

Open an issue on GitHub
Check the troubleshooting section
Review logs in the console output

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_yairwein_mcp_doc_indexer-0.1.0.tar.gz (180.2 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iflow_mcp_yairwein_mcp_doc_indexer-0.1.0-py3-none-any.whl (28.6 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file iflow_mcp_yairwein_mcp_doc_indexer-0.1.0.tar.gz.

File metadata

Download URL: iflow_mcp_yairwein_mcp_doc_indexer-0.1.0.tar.gz
Upload date: Feb 13, 2026
Size: 180.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_yairwein_mcp_doc_indexer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d368917a17066b00f99392668cb6784ea3d09f63a74d40e1ea71d15acd3ad8a8`
MD5	`51de421576b72857e48dfc851e77a346`
BLAKE2b-256	`2a6d62b59ca8f922270ecb2d23cec2ccb92640e537749901dda51a39df76c02d`

See more details on using hashes here.

File details

Details for the file iflow_mcp_yairwein_mcp_doc_indexer-0.1.0-py3-none-any.whl.

File metadata

Download URL: iflow_mcp_yairwein_mcp_doc_indexer-0.1.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 28.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_yairwein_mcp_doc_indexer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f064eb817e27d2e6b0ad26769a7f62824a56b8cc14aaee57298800ea76e1b9c6`
MD5	`ced2e9a21bd975e1102c60446205c0f7`
BLAKE2b-256	`f7329348cf814a54f0c30f9565ba9a9e8dbaac0af8d69e934b84d2bfd37e7d3e`

See more details on using hashes here.

iflow-mcp_yairwein-mcp-doc-indexer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MCP Document Indexer

Features

Installation

Prerequisites

Install MCP Document Indexer

Configuration

Usage

Run as Standalone Service

Integrate with Claude Desktop

MCP Tools

search_documents

get_catalog

get_document_info

reindex_document

get_indexing_stats

Example Usage in Claude

Architecture

File Processing Pipeline

Performance Considerations

Troubleshooting

Ollama Not Available

Permission Issues

Memory Usage

Development

Running Tests

Code Formatting

Building Package

License

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`search_documents`

`get_catalog`

`get_document_info`

`reindex_document`

`get_indexing_stats`