Extract searchable knowledge from any document. Expose it to LLMs via MCP.
Project description
quarry-mcp
Extract searchable knowledge from any document. Expose it to LLMs via MCP.
Quarry ingests PDFs, images, text files, and audio into a local vector database, then serves semantic search over that content through the Model Context Protocol. Point Claude Code or Claude Desktop at your documents and ask questions.
Why Quarry?
If your documents are already machine-readable text (TXT, Markdown, DOCX), mcp-local-rag is a solid zero-config option — one npx command and you're searching.
Quarry exists for documents that aren't text yet:
- Scanned PDFs — Board packs, legal filings, archival records. No embedded text, just page images. Quarry classifies each page, routes image pages through AWS Textract OCR, and extracts text from the rest.
- Mixed-format PDFs — Some pages are text, some are scans. Quarry handles both in a single pipeline.
- Images — Photos of whiteboards, receipts, handwritten notes. (Planned: Epic 3)
- Audio — Meeting recordings, interviews, podcasts. (Planned: Epic 4)
Quarry also preserves full page text alongside chunks, so LLMs can reference surrounding context when a search hit lands mid-page.
Features
- PDF ingestion with automatic text/image classification per page
- OCR via AWS Textract for scanned and image-based documents
- Text extraction via PyMuPDF for text-based PDF pages
- Sentence-aware chunking with configurable overlap
- Local vector embeddings using snowflake-arctic-embed-m-v1.5 (768-dim)
- LanceDB for fast, local vector storage (no external database)
- MCP server with 4 tools:
search_documents,ingest,get_documents,get_page - CLI for ingestion, search, and document management
- Full page text preserved alongside chunks for LLM reference
Quick Start
# Clone and install
git clone https://github.com/jmf-pobox/quarry-mcp.git
cd quarry-mcp
uv sync
# Configure AWS credentials (required for OCR)
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=us-east-1
# Ingest a PDF
uv run quarry ingest /path/to/document.pdf
# Search
uv run quarry search "revenue growth in 2024"
# List indexed documents
uv run quarry list
Installation
Requires Python 3.13+ and uv.
uv sync
For development:
uv pip install -e ".[dev]"
AWS Setup
Quarry uses AWS Textract for OCR. Your IAM user needs:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"textract:StartDocumentTextDetection",
"textract:GetDocumentTextDetection"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::your-bucket/*"
}
]
}
Set your S3 bucket via environment variable or .env file:
export S3_BUCKET=your-bucket-name
Usage
MCP Server (Claude Code)
Add to your Claude Code configuration:
claude mcp add quarry -- uv run --directory /path/to/quarry-mcp python -m quarry mcp
After restarting Claude Code, four tools are available:
| Tool | Description |
|---|---|
search_documents |
Semantic search across all indexed documents |
ingest |
OCR and index a new PDF |
get_documents |
List all indexed documents with metadata |
get_page |
Retrieve full OCR text for a specific page |
MCP Server (Claude Desktop)
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"quarry": {
"command": "uv",
"args": ["run", "--directory", "/path/to/quarry-mcp", "python", "-m", "ocr", "mcp"],
"env": {
"AWS_ACCESS_KEY_ID": "your-key",
"AWS_SECRET_ACCESS_KEY": "your-secret",
"AWS_DEFAULT_REGION": "us-east-1",
"S3_BUCKET": "your-bucket"
}
}
}
}
CLI
# Ingest a document
uv run quarry ingest report.pdf
# Re-ingest (overwrite existing)
uv run quarry ingest report.pdf --overwrite
# Search across all documents
uv run quarry search "board governance structure"
# Search with result limit
uv run quarry search "quarterly revenue" -n 5
# List indexed documents
uv run quarry list
Multiple Indices
Run separate MCP server instances with different data directories:
{
"mcpServers": {
"legal-docs": {
"command": "uv",
"args": ["run", "--directory", "/path/to/quarry-mcp", "python", "-m", "ocr", "mcp"],
"env": { "LANCEDB_PATH": "/data/legal/lancedb" }
},
"financial-reports": {
"command": "uv",
"args": ["run", "--directory", "/path/to/quarry-mcp", "python", "-m", "ocr", "mcp"],
"env": { "LANCEDB_PATH": "/data/financial/lancedb" }
}
}
}
Configuration
All settings are configurable via environment variables or a .env file:
| Variable | Default | Description |
|---|---|---|
AWS_ACCESS_KEY_ID |
AWS access key | |
AWS_SECRET_ACCESS_KEY |
AWS secret key | |
AWS_DEFAULT_REGION |
us-east-1 |
AWS region |
S3_BUCKET |
ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b |
S3 bucket for Textract uploads |
LANCEDB_PATH |
./data/lancedb |
Path to LanceDB storage |
EMBEDDING_MODEL |
Snowflake/snowflake-arctic-embed-m-v1.5 |
HuggingFace embedding model |
CHUNK_MAX_CHARS |
1800 |
Target max characters per chunk (~450 tokens) |
CHUNK_OVERLAP_CHARS |
200 |
Character overlap between consecutive chunks |
TEXTRACT_POLL_INTERVAL |
5 |
Seconds between Textract status checks |
TEXTRACT_MAX_WAIT |
900 |
Maximum seconds to wait for Textract job |
Architecture
Input (PDF)
│
├─ Text pages ──→ PyMuPDF text extraction
│ │
└─ Image pages ─→ S3 upload → Textract async OCR → parse → S3 cleanup
│
Page contents
│
Sentence-aware chunking (with overlap)
│
snowflake-arctic-embed-m-v1.5
│
LanceDB (local vector store)
│
┌────────┴────────┐
│ │
MCP Server CLI
(stdio transport) (typer + rich)
Each chunk stores both its text fragment and the full page raw text, so LLMs can reference surrounding context when a search result is relevant.
Roadmap
Epic 1: PDF Pipeline ✓
Core ingestion and search for PDF documents.
- PDF page analysis (text vs image classification)
- Text extraction via PyMuPDF
- OCR via AWS Textract (async API with polling)
- Sentence-aware chunking with configurable overlap
- Local vector embeddings (snowflake-arctic-embed-m-v1.5)
- LanceDB vector storage with PyArrow schema
- MCP server with search, ingest, list, and page retrieval
- CLI with progress display
- Test suite (62 tests across 9 modules)
Epic 2: Text Document Ingestion
Direct ingestion of text-based formats without OCR.
- Plain text files (.txt)
- Markdown (.md)
- LaTeX (.tex)
- DOCX
- String ingestion (raw text/markdown/HTML without a file)
- Configurable page/section boundary detection
Epic 3: Image Format Support
OCR for standalone image files (not wrapped in PDF).
- Common formats: PNG, JPG, TIFF, BMP, WebP
- Single-image and batch ingestion
- Image preprocessing for OCR quality (deskew, contrast)
Epic 4: Audio Transcription
Speech-to-text ingestion for audio content.
- Audio format support: MP3, WAV, M4A, FLAC
- AWS Transcribe or Whisper integration
- Speaker diarization
- Timestamped chunks for source reference
Epic 5: Ingestion Quality
Post-processing to improve extracted text quality.
- LLM-based OCR error correction
- Chunk quality scoring and filtering
- Duplicate and near-duplicate detection
- Table and figure extraction
Epic 6: Multi-Index Management
First-class support for organizing documents into collections.
- Named indices with isolated storage
- Per-index configuration (embedding model, chunk size)
- Cross-index search
- Index metadata and statistics
Standalone Tasks
- Expose
delete_documentvia MCP and CLI - Add
statustool to MCP server (document/chunk counts, DB size, model info)
Development
# Run all quality gates
uv run ruff check .
uv run ruff format --check .
uv run mypy src/quarry tests
uv run pytest
# Auto-format
uv run ruff format .
The project enforces strict mypy, comprehensive ruff rules, and requires all tests to pass before every commit.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quarry_mcp-0.1.1.tar.gz.
File metadata
- Download URL: quarry_mcp-0.1.1.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73e2f902ba187bba8e953ddf379ed343bc0a8283a90e5777bb623063b8229a40
|
|
| MD5 |
95940d886938884a9ed2b7cd86a3de12
|
|
| BLAKE2b-256 |
d664e322ca6c95e57dbbeae136df6ddff7a0f851a25c372ef1874c4daec632b9
|
File details
Details for the file quarry_mcp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: quarry_mcp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e967fc741a6754f90d84bb32905e3cee7a56f5cd276200cfb60f91a0d6027f1f
|
|
| MD5 |
72de3403309ed65dd96018375cb5663f
|
|
| BLAKE2b-256 |
90baaab61db8cee3b434974faead73d26f28c35fe6a402a6a64486f493e07213
|