MCP server for PDF processing and semantic search with ChromaDB
Project description
MCP PDF ChromaDB Server
A Python-based Model Context Protocol (MCP) server that provides PDF document processing, vectorization, and semantic search capabilities using ChromaDB with local embeddings.
Features
- PDF Loading: Download and process PDFs from URLs
- Local Embeddings: Uses sentence-transformers for local embedding generation (no API required)
- Persistent Storage: ChromaDB for vector storage with metadata
- Semantic Search: Search documents using natural language queries
- Page Extraction: Retrieve specific pages from loaded PDFs
- Metadata Tracking: In-memory metadata storage with persistence
- Call Logging: Automatic logging of all PDF processing and search queries to
call_log.txt
Installation
Prerequisites
- Python 3.10 or higher
- pip
Install from PyPI (Recommended)
# Install the package
pip install mcp-pdf-chroma
# Or install with dev dependencies
pip install mcp-pdf-chroma[dev]
Install from source
git clone <repository-url>
cd mcp_pdf_chroma
# Install dependencies
pip install -e .
# Or install with dev dependencies
pip install -e ".[dev]"
Configuration
Create a .env file in the project root (optional):
# Database paths
CHROMA_DB_PATH=./chroma_db
PDF_CACHE_DIR=./pdf_cache
METADATA_PERSISTENCE_FILE=./metadata_store.json
# Embedding model
EMBEDDING_MODEL=all-MiniLM-L6-v2
# Text processing
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# Limits
MAX_PDF_SIZE_MB=50
Usage
Running the Server
# Run as MCP server (stdio)
python -m mcp_pdf_chroma.server
# Or use the installed script
mcp-pdf-chroma
MCP Client Configuration
Add to your MCP client settings (e.g., Claude Desktop):
If installed from PyPI:
{
"mcpServers": {
"pdf-chroma": {
"command": "mcp-pdf-chroma"
}
}
}
If installed from source:
{
"mcpServers": {
"pdf-chroma": {
"command": "python",
"args": ["-m", "mcp_pdf_chroma.server"],
"cwd": "/path/to/mcp_pdf_chroma"
}
}
}
Available Tools
1. load_pdf
Load a PDF from a URL and insert into ChromaDB.
Parameters:
url(required): URL of the PDF filefilename(optional): Custom name for the document
Example:
{
"url": "https://arxiv.org/pdf/2301.12345.pdf",
"filename": "attention_paper"
}
Returns:
{
"filename": "attention_paper",
"source_url": "https://arxiv.org/pdf/2301.12345.pdf",
"filesize": 2458624,
"filesize_mb": "2.34 MB",
"total_pages": 42,
"total_chunks": 156,
"created_at": "2024-01-15T10:30:00.000Z",
"status": "success",
"message": "Successfully loaded 'attention_paper' with 156 chunks from 42 pages"
}
2. search_text
Search for text in the vector database.
Parameters:
query(required): Search query/questiontop_k(required): Number of results to returnfilename(optional): Filter by document filename
Example:
{
"query": "What is the attention mechanism?",
"top_k": 5,
"filename": "attention_paper"
}
Returns:
{
"query": "What is the attention mechanism?",
"top_k": 5,
"filename_filter": "attention_paper",
"count": 5,
"results": [
{
"text": "The attention mechanism allows...",
"filename": "attention_paper",
"page_number": 3,
"chunk_index": 12,
"similarity_score": 0.8542,
"source_url": "https://arxiv.org/pdf/2301.12345.pdf"
}
]
}
3. get_metadata
Retrieve metadata for a loaded document.
Parameters:
filename(required): Name of the document
Example:
{
"filename": "attention_paper"
}
Returns:
{
"filename": "attention_paper",
"source_url": "https://arxiv.org/pdf/2301.12345.pdf",
"filesize": 2458624,
"filesize_mb": "2.34 MB",
"total_pages": 42,
"total_chunks": 156,
"created_at": "2024-01-15T10:30:00.000Z",
"last_accessed": "2024-01-15T11:45:00.000Z",
"chunk_size": 1000,
"chunk_overlap": 200
}
4. give_page
Get full text of a specific page.
Parameters:
filename(required): Name of the documentpage_number(required): Page number (1-indexed)
Example:
{
"filename": "attention_paper",
"page_number": 5
}
Returns:
{
"filename": "attention_paper",
"page_number": 5,
"total_pages": 42,
"text": "Full text content of page 5..."
}
Architecture
mcp_pdf_chroma/
├── src/
│ └── mcp_pdf_chroma/
│ ├── __init__.py
│ ├── server.py # Main MCP server
│ ├── config.py # Configuration management
│ ├── metadata_store.py # In-memory metadata storage
│ ├── pdf_processor.py # PDF downloading and processing
│ └── vector_db.py # ChromaDB integration
├── pyproject.toml
├── requirements.txt
├── call_log.txt # Automatic call logging (created at runtime)
└── README.md
Call Logging
The server automatically logs all actions to call_log.txt. This includes:
Logged Actions
-
PDF Loading - Logs complete metadata when a PDF is processed:
[2025-12-14T18:15:31.472759Z] LOAD_PDF { "url": "https://example.com/document.pdf", "filename": "my_document", "metadata": { "filesize": 2458624, "filesize_mb": "2.34 MB", "total_pages": 42, "total_chunks": 156, "created_at": "2025-12-14T18:15:31.472041Z", "status": "success" } } -
Search Queries - Logs all search requests from agents:
[2025-12-14T18:15:31.563683Z] SEARCH_TEXT { "query": "What is the attention mechanism?", "top_k": 5, "filename_filter": "attention_paper", "results_count": 5 }
Log File Location
The log file is created in the working directory where the server is started:
- Default:
./call_log.txt - Format: Timestamped JSON entries
- Rotation: Manual (file grows indefinitely)
Monitoring Usage
You can monitor the log in real-time:
# Watch the log file
tail -f call_log.txt
# View recent entries
tail -n 50 call_log.txt
# Search for specific queries
grep "SEARCH_TEXT" call_log.txt
Development
Running Tests
pytest tests/
Code Formatting
black src/
Type Checking
mypy src/
Dependencies
- mcp: Model Context Protocol SDK
- langchain: PDF loading and text processing
- chromadb: Vector database
- sentence-transformers: Local embeddings
- pypdf: PDF parsing
- requests: HTTP downloads
Performance
- Embedding Speed: ~100-500 chunks/second (hardware dependent)
- Search Speed: Sub-second for collections up to 100K chunks
- Storage: ~1KB per chunk (text + embedding + metadata)
Troubleshooting
Large PDFs
If you encounter memory issues with large PDFs:
- Reduce
CHUNK_SIZEin configuration - Increase
MAX_PDF_SIZE_MBif needed - Process PDFs in smaller batches
Embedding Model Download
On first run, the embedding model will be downloaded (~80MB for all-MiniLM-L6-v2). Ensure you have internet connectivity.
ChromaDB Persistence
ChromaDB data is stored in CHROMA_DB_PATH. To reset the database, delete this directory.
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_pdf_chroma-1.0.0.tar.gz.
File metadata
- Download URL: mcp_pdf_chroma-1.0.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfaaae8074ea1142b3f9d37d769b4256faf25bed4b8e30fb54648fdeb0b9c88d
|
|
| MD5 |
21b85cb6e07f8860bf9794d2b2506093
|
|
| BLAKE2b-256 |
35e5c5296bd23e96bfedb15a60216a358d2041b2a7478cfd9c9c851151fb20eb
|
File details
Details for the file mcp_pdf_chroma-1.0.0-py3-none-any.whl.
File metadata
- Download URL: mcp_pdf_chroma-1.0.0-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2c37f117d914dbef9689ac022fa5fce1e8ca20090adb77d07b4b2e5ea994df8
|
|
| MD5 |
d6f4d6ed3b1796d5ee4480bdda976da3
|
|
| BLAKE2b-256 |
9ea5ba8334d3a26233f6d822c16548ba0962ca8d9b9c99051ad6611c18546380
|