MCP server for PDF processing and semantic search with ChromaDB
Project description
MCP PDF ChromaDB Server
A Python-based Model Context Protocol (MCP) server that provides PDF document processing, vectorization, and semantic search capabilities using ChromaDB with local embeddings.
Available on PyPI: https://pypi.org/project/mcp-pdf-chroma/
Quick Start
Get started in 2 minutes:
# No installation required!
uvx mcp-pdf-chroma
Then add this to your MCP client configuration:
For VSCode (GitHub Copilot / Cline):
{
"mcp.servers": {
"pdf-chroma": {
"command": "uvx",
"args": ["mcp-pdf-chroma"]
}
}
}
📖 Detailed VSCode setup guide: See VSCODE_SETUP.md for command palette method, troubleshooting, and advanced configuration.
For Claude Desktop:
{
"mcpServers": {
"pdf-chroma": {
"command": "uvx",
"args": ["mcp-pdf-chroma"]
}
}
}
That's it! The server is now ready to process PDFs and perform semantic search.
Documentation
- NPX_VS_UVX.md - Understanding npx equivalent for Python MCP servers
- VSCODE_SETUP.md - Complete VSCode setup guide with troubleshooting
- EXAMPLES.md - Usage examples and sample queries
- Call Logging - See documentation/CALL_LOGGING.md
Features
- PDF Loading: Download and process PDFs from URLs
- Local Embeddings: Uses sentence-transformers for local embedding generation (no API required)
- Persistent Storage: ChromaDB for vector storage with metadata
- Semantic Search: Search documents using natural language queries
- Page Extraction: Retrieve specific pages from loaded PDFs
- Metadata Tracking: In-memory metadata storage with persistence
- Call Logging: Automatic logging of all PDF processing and search queries to
call_log.txt
Installation
Using uvx (like npx - no installation required):
uvx mcp-pdf-chroma
This is equivalent to Node.js's npx - it downloads and runs the package in isolation without installing it globally.
Note: Requires
uvto be installed. Install with:pip install uvor see https://github.com/astral-sh/uv
Usage
Running the Server
uvx mcp-pdf-chroma
The server runs in stdio mode and communicates with MCP clients via standard input/output.
MCP Client Configuration
Configure your MCP client to use the server with uvx:
For VSCode (GitHub Copilot / Cline)
📖 Complete VSCode Setup Guide: See VSCODE_SETUP.md for detailed instructions, troubleshooting, and advanced configuration.
Method 1: Using VSCode Settings UI (Recommended)
- Open VSCode
- Press
Ctrl+Shift+P(Windows/Linux) orCmd+Shift+P(macOS) - Type "Preferences: Open User Settings (JSON)"
- Add:
{
"mcp.servers": {
"pdf-chroma": {
"command": "uvx",
"args": ["mcp-pdf-chroma"]
}
}
}
Method 2: Manual settings.json Edit
-
Open VSCode settings file:
- Windows:
%APPDATA%\Code\User\settings.json - macOS:
~/Library/Application Support/Code/User/settings.json - Linux:
~/.config/Code/User/settings.json
- Windows:
-
Add:
{
"mcp.servers": {
"pdf-chroma": {
"command": "uvx",
"args": ["mcp-pdf-chroma"]
}
}
}
With Environment Variables:
{
"mcp.servers": {
"pdf-chroma": {
"command": "uvx",
"args": ["mcp-pdf-chroma"],
"env": {
"CHROMA_DB_PATH": "./my_chroma_db",
"PDF_CACHE_DIR": "./my_pdfs",
"CHUNK_SIZE": "1000",
"MAX_PDF_SIZE_MB": "50"
}
}
}
}
For Claude Desktop
Configuration File Location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
Configuration:
{
"mcpServers": {
"pdf-chroma": {
"command": "uvx",
"args": ["mcp-pdf-chroma"]
}
}
}
Environment Variables
You can optionally configure the server using environment variables or a .env file:
# Database paths (created automatically if they don't exist)
CHROMA_DB_PATH=./chroma_db
PDF_CACHE_DIR=./pdf_cache
METADATA_PERSISTENCE_FILE=./metadata_store.json
# Embedding model (downloads automatically on first use)
EMBEDDING_MODEL=all-MiniLM-L6-v2
# Text processing
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# Limits
MAX_PDF_SIZE_MB=50
These are optional - the server will use sensible defaults if not specified.
Available Tools
1. load_pdf
Load a PDF from a URL and insert into ChromaDB.
Parameters:
url(required): URL of the PDF filefilename(optional): Custom name for the document
Example:
{
"url": "https://arxiv.org/pdf/2301.12345.pdf",
"filename": "attention_paper"
}
Returns:
{
"filename": "attention_paper",
"source_url": "https://arxiv.org/pdf/2301.12345.pdf",
"filesize": 2458624,
"filesize_mb": "2.34 MB",
"total_pages": 42,
"total_chunks": 156,
"created_at": "2024-01-15T10:30:00.000Z",
"status": "success",
"message": "Successfully loaded 'attention_paper' with 156 chunks from 42 pages"
}
2. search_text
Search for text in the vector database.
Parameters:
query(required): Search query/questiontop_k(required): Number of results to returnfilename(optional): Filter by document filename
Example:
{
"query": "What is the attention mechanism?",
"top_k": 5,
"filename": "attention_paper"
}
Returns:
{
"query": "What is the attention mechanism?",
"top_k": 5,
"filename_filter": "attention_paper",
"count": 5,
"results": [
{
"text": "The attention mechanism allows...",
"filename": "attention_paper",
"page_number": 3,
"chunk_index": 12,
"similarity_score": 0.8542,
"source_url": "https://arxiv.org/pdf/2301.12345.pdf"
}
]
}
3. get_metadata
Retrieve metadata for a loaded document.
Parameters:
filename(required): Name of the document
Example:
{
"filename": "attention_paper"
}
Returns:
{
"filename": "attention_paper",
"source_url": "https://arxiv.org/pdf/2301.12345.pdf",
"filesize": 2458624,
"filesize_mb": "2.34 MB",
"total_pages": 42,
"total_chunks": 156,
"created_at": "2024-01-15T10:30:00.000Z",
"last_accessed": "2024-01-15T11:45:00.000Z",
"chunk_size": 1000,
"chunk_overlap": 200
}
4. give_page
Get full text of a specific page.
Parameters:
filename(required): Name of the documentpage_number(required): Page number (1-indexed)
Example:
{
"filename": "attention_paper",
"page_number": 5
}
Returns:
{
"filename": "attention_paper",
"page_number": 5,
"total_pages": 42,
"text": "Full text content of page 5..."
}
Architecture
mcp_pdf_chroma/
├── src/
│ └── mcp_pdf_chroma/
│ ├── __init__.py
│ ├── server.py # Main MCP server
│ ├── config.py # Configuration management
│ ├── metadata_store.py # In-memory metadata storage
│ ├── pdf_processor.py # PDF downloading and processing
│ └── vector_db.py # ChromaDB integration
├── pyproject.toml
├── requirements.txt
├── call_log.txt # Automatic call logging (created at runtime)
└── README.md
Call Logging
The server automatically logs all actions to call_log.txt. This includes:
Logged Actions
-
PDF Loading - Logs complete metadata when a PDF is processed:
[2025-12-14T18:15:31.472759Z] LOAD_PDF { "url": "https://example.com/document.pdf", "filename": "my_document", "metadata": { "filesize": 2458624, "filesize_mb": "2.34 MB", "total_pages": 42, "total_chunks": 156, "created_at": "2025-12-14T18:15:31.472041Z", "status": "success" } } -
Search Queries - Logs all search requests from agents:
[2025-12-14T18:15:31.563683Z] SEARCH_TEXT { "query": "What is the attention mechanism?", "top_k": 5, "filename_filter": "attention_paper", "results_count": 5 }
Log File Location
The log file is created in the working directory where the server is started:
- Default:
./call_log.txt - Format: Timestamped JSON entries
- Rotation: Manual (file grows indefinitely)
Monitoring Usage
You can monitor the log in real-time:
# Watch the log file
tail -f call_log.txt
# View recent entries
tail -n 50 call_log.txt
# Search for specific queries
grep "SEARCH_TEXT" call_log.txt
For detailed information on call logging, see documentation/CALL_LOGGING.md.
Development
Running Tests
pytest tests/
Code Formatting
black src/
Type Checking
mypy src/
Dependencies
- mcp: Model Context Protocol SDK
- langchain: PDF loading and text processing
- chromadb: Vector database
- sentence-transformers: Local embeddings
- pypdf: PDF parsing
- requests: HTTP downloads
Performance
- Embedding Speed: ~100-500 chunks/second (hardware dependent)
- Search Speed: Sub-second for collections up to 100K chunks
- Storage: ~1KB per chunk (text + embedding + metadata)
Troubleshooting
First-Time Setup
On first run, the server will:
- Download the embedding model (~80MB for all-MiniLM-L6-v2)
- Create the database directories automatically
- Initialize the ChromaDB collection
This is normal and only happens once. Ensure you have internet connectivity for the initial model download.
Installation Issues
If you encounter installation errors:
# Upgrade pip first
pip install --upgrade pip
# Try installing with verbose output to see what's failing
pip install -v mcp-pdf-chroma
# If you have dependency conflicts, use a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install mcp-pdf-chroma
Command Not Found
If mcp-pdf-chroma command is not found after installation:
# Check if it's installed
pip show mcp-pdf-chroma
# Use the Python module syntax instead
python -m mcp_pdf_chroma.server
# Or reinstall with --force-reinstall
pip install --force-reinstall mcp-pdf-chroma
Large PDFs
If you encounter memory issues with large PDFs:
- Reduce
CHUNK_SIZEin configuration - Increase
MAX_PDF_SIZE_MBif needed - Process PDFs in smaller batches
Embedding Model Download
On first run, the embedding model will be downloaded (~80MB for all-MiniLM-L6-v2). Ensure you have internet connectivity.
ChromaDB Persistence
ChromaDB data is stored in CHROMA_DB_PATH. To reset the database, delete this directory.
License
MIT License
Links
- PyPI Package: https://pypi.org/project/mcp-pdf-chroma/
- GitHub Repository: https://github.com/yourusername/mcp_pdf_chroma
- Issue Tracker: https://github.com/yourusername/mcp_pdf_chroma/issues
Updates
To update to the latest version:
pip install --upgrade mcp-pdf-chroma
To check your current version:
pip show mcp-pdf-chroma
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
For development setup:
# Clone the repository
git clone https://github.com/yourusername/mcp_pdf_chroma.git
cd mcp_pdf_chroma
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black src/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_pdf_chroma-1.0.1.tar.gz.
File metadata
- Download URL: mcp_pdf_chroma-1.0.1.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84da9cef33c72a17878b5b2bf69265768b30e0996ab2ff5458b314d4ec230084
|
|
| MD5 |
03bfac40d20b3aab7daa96d591211e68
|
|
| BLAKE2b-256 |
13a0ec854dd36f9b44fbbe94940bf41d9ceded0131c22d7305bdcdb70e822b14
|
File details
Details for the file mcp_pdf_chroma-1.0.1-py3-none-any.whl.
File metadata
- Download URL: mcp_pdf_chroma-1.0.1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14a788e4a2cba4645e3de83a82d5e3a2f91b977251c968ff7088a4ae854645e3
|
|
| MD5 |
620f3cf19c2a92ac0e8e75ebfe9f8bdf
|
|
| BLAKE2b-256 |
745508d179c1bf88e285fa3416e427dca1773d195d1cbfbbe3ad8941821fa9fa
|