MCP server for PDF processing and semantic search with ChromaDB

These details have not been verified by PyPI

Project links

Project description

MCP PDF ChromaDB Server

A Python-based Model Context Protocol (MCP) server that provides PDF document processing, vectorization, and semantic search capabilities using ChromaDB with local embeddings.

Features

PDF Loading: Download and process PDFs from URLs
Local Embeddings: Uses sentence-transformers for local embedding generation (no API required)
Persistent Storage: ChromaDB for vector storage with metadata
Semantic Search: Search documents using natural language queries
Page Extraction: Retrieve specific pages from loaded PDFs
Metadata Tracking: In-memory metadata storage with persistence
Call Logging: Automatic logging of all PDF processing and search queries to call_log.txt

Installation

Prerequisites

Python 3.10 or higher
pip

Install from PyPI (Recommended)

# Install the package
pip install mcp-pdf-chroma

# Or install with dev dependencies
pip install mcp-pdf-chroma[dev]

Install from source

git clone <repository-url>
cd mcp_pdf_chroma

# Install dependencies
pip install -e .

# Or install with dev dependencies
pip install -e ".[dev]"

Configuration

Create a .env file in the project root (optional):

# Database paths
CHROMA_DB_PATH=./chroma_db
PDF_CACHE_DIR=./pdf_cache
METADATA_PERSISTENCE_FILE=./metadata_store.json

# Embedding model
EMBEDDING_MODEL=all-MiniLM-L6-v2

# Text processing
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Limits
MAX_PDF_SIZE_MB=50

Usage

Running the Server

# Run as MCP server (stdio)
python -m mcp_pdf_chroma.server

# Or use the installed script
mcp-pdf-chroma

MCP Client Configuration

Add to your MCP client settings (e.g., Claude Desktop):

If installed from PyPI:

{
  "mcpServers": {
    "pdf-chroma": {
      "command": "mcp-pdf-chroma"
    }
  }
}

If installed from source:

{
  "mcpServers": {
    "pdf-chroma": {
      "command": "python",
      "args": ["-m", "mcp_pdf_chroma.server"],
      "cwd": "/path/to/mcp_pdf_chroma"
    }
  }
}

Available Tools

1. load_pdf

Load a PDF from a URL and insert into ChromaDB.

Parameters:

url (required): URL of the PDF file
filename (optional): Custom name for the document

Example:

{
  "url": "https://arxiv.org/pdf/2301.12345.pdf",
  "filename": "attention_paper"
}

Returns:

{
  "filename": "attention_paper",
  "source_url": "https://arxiv.org/pdf/2301.12345.pdf",
  "filesize": 2458624,
  "filesize_mb": "2.34 MB",
  "total_pages": 42,
  "total_chunks": 156,
  "created_at": "2024-01-15T10:30:00.000Z",
  "status": "success",
  "message": "Successfully loaded 'attention_paper' with 156 chunks from 42 pages"
}

2. search_text

Search for text in the vector database.

Parameters:

query (required): Search query/question
top_k (required): Number of results to return
filename (optional): Filter by document filename

Example:

{
  "query": "What is the attention mechanism?",
  "top_k": 5,
  "filename": "attention_paper"
}

Returns:

{
  "query": "What is the attention mechanism?",
  "top_k": 5,
  "filename_filter": "attention_paper",
  "count": 5,
  "results": [
    {
      "text": "The attention mechanism allows...",
      "filename": "attention_paper",
      "page_number": 3,
      "chunk_index": 12,
      "similarity_score": 0.8542,
      "source_url": "https://arxiv.org/pdf/2301.12345.pdf"
    }
  ]
}

3. get_metadata

Retrieve metadata for a loaded document.

Parameters:

filename (required): Name of the document

Example:

{
  "filename": "attention_paper"
}

Returns:

{
  "filename": "attention_paper",
  "source_url": "https://arxiv.org/pdf/2301.12345.pdf",
  "filesize": 2458624,
  "filesize_mb": "2.34 MB",
  "total_pages": 42,
  "total_chunks": 156,
  "created_at": "2024-01-15T10:30:00.000Z",
  "last_accessed": "2024-01-15T11:45:00.000Z",
  "chunk_size": 1000,
  "chunk_overlap": 200
}

4. give_page

Get full text of a specific page.

Parameters:

filename (required): Name of the document
page_number (required): Page number (1-indexed)

Example:

{
  "filename": "attention_paper",
  "page_number": 5
}

Returns:

{
  "filename": "attention_paper",
  "page_number": 5,
  "total_pages": 42,
  "text": "Full text content of page 5..."
}

Architecture

mcp_pdf_chroma/
├── src/
│   └── mcp_pdf_chroma/
│       ├── __init__.py
│       ├── server.py          # Main MCP server
│       ├── config.py           # Configuration management
│       ├── metadata_store.py   # In-memory metadata storage
│       ├── pdf_processor.py    # PDF downloading and processing
│       └── vector_db.py        # ChromaDB integration
├── pyproject.toml
├── requirements.txt
├── call_log.txt               # Automatic call logging (created at runtime)
└── README.md

Call Logging

The server automatically logs all actions to call_log.txt. This includes:

Logged Actions

PDF Loading - Logs complete metadata when a PDF is processed:

[2025-12-14T18:15:31.472759Z] LOAD_PDF
{
  "url": "https://example.com/document.pdf",
  "filename": "my_document",
  "metadata": {
    "filesize": 2458624,
    "filesize_mb": "2.34 MB",
    "total_pages": 42,
    "total_chunks": 156,
    "created_at": "2025-12-14T18:15:31.472041Z",
    "status": "success"
  }
}

Search Queries - Logs all search requests from agents:

[2025-12-14T18:15:31.563683Z] SEARCH_TEXT
{
  "query": "What is the attention mechanism?",
  "top_k": 5,
  "filename_filter": "attention_paper",
  "results_count": 5
}

Log File Location

The log file is created in the working directory where the server is started:

Default: ./call_log.txt
Format: Timestamped JSON entries
Rotation: Manual (file grows indefinitely)

Monitoring Usage

You can monitor the log in real-time:

# Watch the log file
tail -f call_log.txt

# View recent entries
tail -n 50 call_log.txt

# Search for specific queries
grep "SEARCH_TEXT" call_log.txt

Development

Running Tests

pytest tests/

Code Formatting

black src/

Type Checking

mypy src/

Dependencies

mcp: Model Context Protocol SDK
langchain: PDF loading and text processing
chromadb: Vector database
sentence-transformers: Local embeddings
pypdf: PDF parsing
requests: HTTP downloads

Performance

Embedding Speed: ~100-500 chunks/second (hardware dependent)
Search Speed: Sub-second for collections up to 100K chunks
Storage: ~1KB per chunk (text + embedding + metadata)

Troubleshooting

Large PDFs

If you encounter memory issues with large PDFs:

Reduce CHUNK_SIZE in configuration
Increase MAX_PDF_SIZE_MB if needed
Process PDFs in smaller batches

Embedding Model Download

On first run, the embedding model will be downloaded (~80MB for all-MiniLM-L6-v2). Ensure you have internet connectivity.

ChromaDB Persistence

ChromaDB data is stored in CHROMA_DB_PATH. To reset the database, delete this directory.

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Dec 14, 2025

This version

1.0.0

Dec 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_pdf_chroma-1.0.0.tar.gz (15.3 kB view details)

Uploaded Dec 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_pdf_chroma-1.0.0-py3-none-any.whl (14.9 kB view details)

Uploaded Dec 14, 2025 Python 3

File details

Details for the file mcp_pdf_chroma-1.0.0.tar.gz.

File metadata

Download URL: mcp_pdf_chroma-1.0.0.tar.gz
Upload date: Dec 14, 2025
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for mcp_pdf_chroma-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`cfaaae8074ea1142b3f9d37d769b4256faf25bed4b8e30fb54648fdeb0b9c88d`
MD5	`21b85cb6e07f8860bf9794d2b2506093`
BLAKE2b-256	`35e5c5296bd23e96bfedb15a60216a358d2041b2a7478cfd9c9c851151fb20eb`

See more details on using hashes here.

File details

Details for the file mcp_pdf_chroma-1.0.0-py3-none-any.whl.

File metadata

Download URL: mcp_pdf_chroma-1.0.0-py3-none-any.whl
Upload date: Dec 14, 2025
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for mcp_pdf_chroma-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2c37f117d914dbef9689ac022fa5fce1e8ca20090adb77d07b4b2e5ea994df8`
MD5	`d6f4d6ed3b1796d5ee4480bdda976da3`
BLAKE2b-256	`9ea5ba8334d3a26233f6d822c16548ba0962ca8d9b9c99051ad6611c18546380`

See more details on using hashes here.

mcp-pdf-chroma 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MCP PDF ChromaDB Server

Features

Installation

Prerequisites

Install from PyPI (Recommended)

Install from source

Configuration

Usage

Running the Server

MCP Client Configuration

Available Tools

1. load_pdf

2. search_text

3. get_metadata

4. give_page

Architecture

Call Logging

Logged Actions

Log File Location

Monitoring Usage

Development

Running Tests

Code Formatting

Type Checking

Dependencies

Performance

Troubleshooting

Large PDFs

Embedding Model Download

ChromaDB Persistence

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes