MCP server for PDF processing and semantic search with ChromaDB

These details have not been verified by PyPI

Project links

Project description

MCP PDF ChromaDB Server

A Python-based Model Context Protocol (MCP) server that provides PDF document processing, vectorization, and semantic search capabilities using ChromaDB with local embeddings.

Available on PyPI: https://pypi.org/project/mcp-pdf-chroma/

Quick Start

Get started in 2 minutes:

# No installation required!
uvx mcp-pdf-chroma

Then add this to your MCP client configuration:

For VSCode (GitHub Copilot / Cline):

{
  "mcp.servers": {
    "pdf-chroma": {
      "command": "uvx",
      "args": ["mcp-pdf-chroma"]
    }
  }
}

📖 Detailed VSCode setup guide: See VSCODE_SETUP.md for command palette method, troubleshooting, and advanced configuration.

For Claude Desktop:

{
  "mcpServers": {
    "pdf-chroma": {
      "command": "uvx",
      "args": ["mcp-pdf-chroma"]
    }
  }
}

That's it! The server is now ready to process PDFs and perform semantic search.

Documentation

NPX_VS_UVX.md - Understanding npx equivalent for Python MCP servers
VSCODE_SETUP.md - Complete VSCode setup guide with troubleshooting
EXAMPLES.md - Usage examples and sample queries
Call Logging - See documentation/CALL_LOGGING.md

Features

PDF Loading: Download and process PDFs from URLs
Local Embeddings: Uses sentence-transformers for local embedding generation (no API required)
Persistent Storage: ChromaDB for vector storage with metadata
Semantic Search: Search documents using natural language queries
Page Extraction: Retrieve specific pages from loaded PDFs
Metadata Tracking: In-memory metadata storage with persistence
Call Logging: Automatic logging of all PDF processing and search queries to call_log.txt

Installation

Using uvx (like npx - no installation required):

uvx mcp-pdf-chroma

This is equivalent to Node.js's npx - it downloads and runs the package in isolation without installing it globally.

Note: Requires uv to be installed. Install with: pip install uv or see https://github.com/astral-sh/uv

Usage

Running the Server

uvx mcp-pdf-chroma

The server runs in stdio mode and communicates with MCP clients via standard input/output.

MCP Client Configuration

Configure your MCP client to use the server with uvx:

For VSCode (GitHub Copilot / Cline)

📖 Complete VSCode Setup Guide: See VSCODE_SETUP.md for detailed instructions, troubleshooting, and advanced configuration.

Method 1: Using VSCode Settings UI (Recommended)

Open VSCode
Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (macOS)
Type "Preferences: Open User Settings (JSON)"
Add:

{
  "mcp.servers": {
    "pdf-chroma": {
      "command": "uvx",
      "args": ["mcp-pdf-chroma"]
    }
  }
}

Method 2: Manual settings.json Edit

Open VSCode settings file:
- Windows: %APPDATA%\Code\User\settings.json
- macOS: ~/Library/Application Support/Code/User/settings.json
- Linux: ~/.config/Code/User/settings.json
Add:

{
  "mcp.servers": {
    "pdf-chroma": {
      "command": "uvx",
      "args": ["mcp-pdf-chroma"]
    }
  }
}

With Environment Variables:

{
  "mcp.servers": {
    "pdf-chroma": {
      "command": "uvx",
      "args": ["mcp-pdf-chroma"],
      "env": {
        "CHROMA_DB_PATH": "./my_chroma_db",
        "PDF_CACHE_DIR": "./my_pdfs",
        "CHUNK_SIZE": "1000",
        "MAX_PDF_SIZE_MB": "50"
      }
    }
  }
}

For Claude Desktop

Configuration File Location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

Configuration:

{
  "mcpServers": {
    "pdf-chroma": {
      "command": "uvx",
      "args": ["mcp-pdf-chroma"]
    }
  }
}

Environment Variables

You can optionally configure the server using environment variables or a .env file:

# Database paths (created automatically if they don't exist)
CHROMA_DB_PATH=./chroma_db
PDF_CACHE_DIR=./pdf_cache
METADATA_PERSISTENCE_FILE=./metadata_store.json

# Embedding model (downloads automatically on first use)
EMBEDDING_MODEL=all-MiniLM-L6-v2

# Text processing
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Limits
MAX_PDF_SIZE_MB=50

These are optional - the server will use sensible defaults if not specified.

Available Tools

1. load_pdf

Load a PDF from a URL and insert into ChromaDB.

Parameters:

url (required): URL of the PDF file
filename (optional): Custom name for the document

Example:

{
  "url": "https://arxiv.org/pdf/2301.12345.pdf",
  "filename": "attention_paper"
}

Returns:

{
  "filename": "attention_paper",
  "source_url": "https://arxiv.org/pdf/2301.12345.pdf",
  "filesize": 2458624,
  "filesize_mb": "2.34 MB",
  "total_pages": 42,
  "total_chunks": 156,
  "created_at": "2024-01-15T10:30:00.000Z",
  "status": "success",
  "message": "Successfully loaded 'attention_paper' with 156 chunks from 42 pages"
}

2. search_text

Search for text in the vector database.

Parameters:

query (required): Search query/question
top_k (required): Number of results to return
filename (optional): Filter by document filename

Example:

{
  "query": "What is the attention mechanism?",
  "top_k": 5,
  "filename": "attention_paper"
}

Returns:

{
  "query": "What is the attention mechanism?",
  "top_k": 5,
  "filename_filter": "attention_paper",
  "count": 5,
  "results": [
    {
      "text": "The attention mechanism allows...",
      "filename": "attention_paper",
      "page_number": 3,
      "chunk_index": 12,
      "similarity_score": 0.8542,
      "source_url": "https://arxiv.org/pdf/2301.12345.pdf"
    }
  ]
}

3. get_metadata

Retrieve metadata for a loaded document.

Parameters:

filename (required): Name of the document

Example:

{
  "filename": "attention_paper"
}

Returns:

{
  "filename": "attention_paper",
  "source_url": "https://arxiv.org/pdf/2301.12345.pdf",
  "filesize": 2458624,
  "filesize_mb": "2.34 MB",
  "total_pages": 42,
  "total_chunks": 156,
  "created_at": "2024-01-15T10:30:00.000Z",
  "last_accessed": "2024-01-15T11:45:00.000Z",
  "chunk_size": 1000,
  "chunk_overlap": 200
}

4. give_page

Get full text of a specific page.

Parameters:

filename (required): Name of the document
page_number (required): Page number (1-indexed)

Example:

{
  "filename": "attention_paper",
  "page_number": 5
}

Returns:

{
  "filename": "attention_paper",
  "page_number": 5,
  "total_pages": 42,
  "text": "Full text content of page 5..."
}

Architecture

mcp_pdf_chroma/
├── src/
│   └── mcp_pdf_chroma/
│       ├── __init__.py
│       ├── server.py          # Main MCP server
│       ├── config.py           # Configuration management
│       ├── metadata_store.py   # In-memory metadata storage
│       ├── pdf_processor.py    # PDF downloading and processing
│       └── vector_db.py        # ChromaDB integration
├── pyproject.toml
├── requirements.txt
├── call_log.txt               # Automatic call logging (created at runtime)
└── README.md

Call Logging

The server automatically logs all actions to call_log.txt. This includes:

Logged Actions

PDF Loading - Logs complete metadata when a PDF is processed:

[2025-12-14T18:15:31.472759Z] LOAD_PDF
{
  "url": "https://example.com/document.pdf",
  "filename": "my_document",
  "metadata": {
    "filesize": 2458624,
    "filesize_mb": "2.34 MB",
    "total_pages": 42,
    "total_chunks": 156,
    "created_at": "2025-12-14T18:15:31.472041Z",
    "status": "success"
  }
}

Search Queries - Logs all search requests from agents:

[2025-12-14T18:15:31.563683Z] SEARCH_TEXT
{
  "query": "What is the attention mechanism?",
  "top_k": 5,
  "filename_filter": "attention_paper",
  "results_count": 5
}

Log File Location

The log file is created in the working directory where the server is started:

Default: ./call_log.txt
Format: Timestamped JSON entries
Rotation: Manual (file grows indefinitely)

Monitoring Usage

You can monitor the log in real-time:

# Watch the log file
tail -f call_log.txt

# View recent entries
tail -n 50 call_log.txt

# Search for specific queries
grep "SEARCH_TEXT" call_log.txt

For detailed information on call logging, see documentation/CALL_LOGGING.md.

Development

Running Tests

pytest tests/

Code Formatting

black src/

Type Checking

mypy src/

Dependencies

mcp: Model Context Protocol SDK
langchain: PDF loading and text processing
chromadb: Vector database
sentence-transformers: Local embeddings
pypdf: PDF parsing
requests: HTTP downloads

Performance

Embedding Speed: ~100-500 chunks/second (hardware dependent)
Search Speed: Sub-second for collections up to 100K chunks
Storage: ~1KB per chunk (text + embedding + metadata)

Troubleshooting

First-Time Setup

On first run, the server will:

Download the embedding model (~80MB for all-MiniLM-L6-v2)
Create the database directories automatically
Initialize the ChromaDB collection

This is normal and only happens once. Ensure you have internet connectivity for the initial model download.

Installation Issues

If you encounter installation errors:

# Upgrade pip first
pip install --upgrade pip

# Try installing with verbose output to see what's failing
pip install -v mcp-pdf-chroma

# If you have dependency conflicts, use a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install mcp-pdf-chroma

Command Not Found

If mcp-pdf-chroma command is not found after installation:

# Check if it's installed
pip show mcp-pdf-chroma

# Use the Python module syntax instead
python -m mcp_pdf_chroma.server

# Or reinstall with --force-reinstall
pip install --force-reinstall mcp-pdf-chroma

Large PDFs

If you encounter memory issues with large PDFs:

Reduce CHUNK_SIZE in configuration
Increase MAX_PDF_SIZE_MB if needed
Process PDFs in smaller batches

Embedding Model Download

On first run, the embedding model will be downloaded (~80MB for all-MiniLM-L6-v2). Ensure you have internet connectivity.

ChromaDB Persistence

ChromaDB data is stored in CHROMA_DB_PATH. To reset the database, delete this directory.

License

MIT License

Updates

To update to the latest version:

pip install --upgrade mcp-pdf-chroma

To check your current version:

pip show mcp-pdf-chroma

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

For development setup:

# Clone the repository
git clone https://github.com/yourusername/mcp_pdf_chroma.git
cd mcp_pdf_chroma

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black src/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Dec 14, 2025

1.0.0

Dec 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_pdf_chroma-1.0.1.tar.gz (18.2 kB view details)

Uploaded Dec 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_pdf_chroma-1.0.1-py3-none-any.whl (16.3 kB view details)

Uploaded Dec 14, 2025 Python 3

File details

Details for the file mcp_pdf_chroma-1.0.1.tar.gz.

File metadata

Download URL: mcp_pdf_chroma-1.0.1.tar.gz
Upload date: Dec 14, 2025
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for mcp_pdf_chroma-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`84da9cef33c72a17878b5b2bf69265768b30e0996ab2ff5458b314d4ec230084`
MD5	`03bfac40d20b3aab7daa96d591211e68`
BLAKE2b-256	`13a0ec854dd36f9b44fbbe94940bf41d9ceded0131c22d7305bdcdb70e822b14`

See more details on using hashes here.

File details

Details for the file mcp_pdf_chroma-1.0.1-py3-none-any.whl.

File metadata

Download URL: mcp_pdf_chroma-1.0.1-py3-none-any.whl
Upload date: Dec 14, 2025
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for mcp_pdf_chroma-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`14a788e4a2cba4645e3de83a82d5e3a2f91b977251c968ff7088a4ae854645e3`
MD5	`620f3cf19c2a92ac0e8e75ebfe9f8bdf`
BLAKE2b-256	`745508d179c1bf88e285fa3416e427dca1773d195d1cbfbbe3ad8941821fa9fa`

See more details on using hashes here.

mcp-pdf-chroma 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MCP PDF ChromaDB Server

Quick Start

Documentation

Features

Installation

Usage

Running the Server

MCP Client Configuration

For VSCode (GitHub Copilot / Cline)

For Claude Desktop

Environment Variables

Available Tools

1. load_pdf

2. search_text

3. get_metadata

4. give_page

Architecture

Call Logging

Logged Actions

Log File Location

Monitoring Usage

Development

Running Tests

Code Formatting

Type Checking

Dependencies

Performance

Troubleshooting

First-Time Setup

Installation Issues

Command Not Found

Large PDFs

Embedding Model Download

ChromaDB Persistence

License

Links

Updates

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes