Skip to main content

Document Search MCP Server with extensible plugin architecture

Project description

Document Search MCP

A Model Context Protocol (MCP) server that provides intelligent document search across multiple sources, starting with Google Drive integration.

Overview

This MCP server enables AI assistants like Claude Desktop to search and retrieve documents from connected sources. It implements the official MCP protocol and provides a clean, extensible architecture for adding new document connectors.

Features

  • ๐Ÿ” Multi-source document search - Search across Google Drive documents, sheets, and presentations
  • ๐Ÿ” OAuth 2.0 authentication - Secure authentication with environment-based credentials
  • ๐Ÿ“„ Full content retrieval - Get complete document content for analysis
  • ๐Ÿ”Œ Extensible plugin system - Ready framework for custom enhancements
  • ๐Ÿ—๏ธ Modular architecture - Clean separation of connectors, models, and search orchestration

Quick Start

Prerequisites

  • Python 3.11+
  • Google OAuth 2.0 credentials (for Google Drive integration)

Installation

# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone <repository-url>
cd document-search-mcp
uv venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"

Configuration

  1. Set up Google OAuth credentials:
export GOOGLE_CLIENT_ID="your-client-id"
export GOOGLE_CLIENT_SECRET="your-client-secret"
  1. Configure Claude Desktop by adding to ~/.claude/mcp_servers.json:
{
  "mcpServers": {
    "document-search": {
      "command": "document-search-mcp"
    }
  }
}

Usage

# Start the MCP server
document-search-mcp

# Or with debug logging
document-search-mcp --log-level DEBUG

MCP Tools

The server provides these MCP tools:

  • search_documents - Search across connected document sources
  • get_document_content - Retrieve full content from documents
  • list_sources - Show configured document sources and status
  • setup_google_drive - OAuth setup and configuration wizard

Supported Document Sources

  • โœ… Google Drive - Google Docs, Sheets, and Slides with OAuth 2.0
  • ๐Ÿšง Confluence - Planned (connector interface ready)
  • ๐Ÿšง SharePoint - Planned
  • ๐Ÿšง Other sources - Framework ready for extension

Development

Running Tests

# Run all tests with coverage
pytest tests/ --cov=src --cov-report=html

# Run specific test categories
pytest -m unit              # Unit tests only
pytest -m integration       # Integration tests only  
pytest -m "not slow"        # Skip slow tests

Code Quality

# Type checking
mypy src/

# Linting and formatting
ruff check src/              # Lint check
ruff format src/             # Auto-format code

# Security scanning
bandit -r src/               # Security issues
safety check                 # Vulnerable dependencies

Adding New Document Connectors

Create a new connector by extending the base class:

from src.connectors.base_connector import DocumentConnector
from src.models.document import Document

class MySourceConnector(DocumentConnector):
    def get_documents(self, options: dict[str, Any] | None = None) -> AsyncIterator[Document]:
        # Implement async generator for document retrieval
        yield document
    
    async def get_document(self, document_id: str) -> Document:
        # Implement single document retrieval
        pass
        
    async def search_documents(self, query: str, options: dict[str, Any] | None = None) -> list[DocumentMatch]:
        # Implement search functionality
        pass

Architecture

Core Components

  • MCP Server (src/server/mcp_server.py) - Main MCP protocol implementation
  • Document Connectors (src/connectors/) - Modular interfaces for document sources
  • Search Orchestrator (src/server/search_orchestrator.py) - Multi-source search coordination
  • Plugin System (src/plugins/) - Extensible framework for enhancements
  • Data Models (src/models/) - Document and search models with Pydantic validation

Project Structure

src/
โ”œโ”€โ”€ main.py                 # CLI entry point with Click interface
โ”œโ”€โ”€ models/                 # Pydantic data models
โ”œโ”€โ”€ connectors/             # Document source connectors
โ”‚   โ”œโ”€โ”€ base_connector.py   # Abstract base class
โ”‚   โ””โ”€โ”€ google_drive_connector.py  # Google Drive implementation
โ”œโ”€โ”€ server/                 # MCP server implementation
โ”‚   โ”œโ”€โ”€ mcp_server.py       # Main MCP protocol handling
โ”‚   โ””โ”€โ”€ search_orchestrator.py     # Multi-source coordination
โ””โ”€โ”€ plugins/                # Plugin system framework
    โ””โ”€โ”€ base_plugin.py      # Plugin interfaces

tests/
โ””โ”€โ”€ test_basic.py          # Basic functionality tests

config/
โ”œโ”€โ”€ config.yaml            # Default configuration
โ””โ”€โ”€ config.yaml.local      # Local development config

Configuration

The server uses environment-based configuration with automatic persistence:

  • OAuth credentials: Set via environment variables (never hardcoded)
  • Configuration file: Automatically saved to ~/.config/document-search-mcp/config.yaml
  • Setup wizard: Use the setup_google_drive MCP tool for guided OAuth setup

Google Drive Setup Process

  1. Set environment variables with your Google OAuth credentials
  2. Use setup_google_drive MCP tool with step: "start"
  3. Visit provided OAuth URL to authorize access
  4. Complete setup with step: "complete" and redirect URL
  5. Configuration persists automatically for future use

CI/CD Pipeline

The project uses GitLab CI with a PyPI publishing pipeline:

Pipeline Stages

  • validate - Code quality checks (ruff, mypy, bandit, safety)
  • build - Python package building
  • test - Package integrity testing and unit tests
  • publish - PyPI publishing (manual/tag-triggered)

Running Validation Locally

# Complete validation suite (matches CI)
ruff check src/
ruff format --check src/
mypy src/
bandit -r src/
safety check

Package Management

This project uses uv for fast Python package management:

# Development environment setup
uv venv
uv pip install -e ".[dev]"

# Package building
python -m build

# Validate package
python -c "import tomllib; tomllib.load(open('pyproject.toml', 'rb'))"

Current Implementation Status

โœ… Completed

  • Complete MCP server implementation with Google Drive integration
  • OAuth 2.0 authentication with environment-based credentials
  • Document search and content retrieval across Google Docs/Sheets/Slides
  • Extensible plugin architecture and data models
  • Comprehensive test framework with markers and coverage
  • GitLab CI/CD pipeline for Python package publishing
  • Type safety with strict mypy configuration (all type errors resolved)
  • Code formatting and linting with ruff

๐Ÿšง In Progress/Planned

  • Additional document connectors (Confluence, SharePoint, etc.)
  • Semantic search with vector embeddings
  • Plugin implementations for specific domains
  • Enhanced metadata extraction and filtering
  • Web-based configuration interface

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Run the full validation suite:
    ruff check src/ && ruff format --check src/ && mypy src/ && bandit -r src/
    
  5. Submit a pull request

License

[Add your license information here]

Support

For issues and feature requests, please use the project's issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_search_mcp-0.1.0.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_search_mcp-0.1.0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file document_search_mcp-0.1.0.tar.gz.

File metadata

  • Download URL: document_search_mcp-0.1.0.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for document_search_mcp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 73e2c2735b91eb06238779c5052e742e1fd19242ad2a8d9595c69dbfc3b5a456
MD5 4e7014dfc3adba492cf4df36f98a6a45
BLAKE2b-256 2b7a33f592b8117a686240d683c94d00adc98f1aa9fad02e9ca0002b34b2f0cb

See more details on using hashes here.

File details

Details for the file document_search_mcp-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_search_mcp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20f0b7143f3599de3ba0e613b243bdd0391a221966ba6bda70aaa1a3d436b54e
MD5 b191d9750f76b2efa341024fddd97ec3
BLAKE2b-256 cf5fdb08790c41eaed01b4daaa0663f3b1f6c97a2729db144c8ed8b4ef3dca33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page