Extract searchable knowledge from any document. Expose it to LLMs via MCP.
Project description
quarry-mcp
Index any document. Search with natural language. Works with Claude Code and Claude Desktop.
Quick Start
pip install quarry-mcp
quarry install # downloads embedding model (~500MB), configures MCP
quarry ingest notes.md # index a file — no cloud account needed
quarry search "my topic"
That's it. Quarry works locally out of the box.
What It Does
Quarry turns documents into searchable knowledge for LLMs. You feed it files, it chunks and embeds them into a local vector database, and exposes semantic search via MCP tools or a CLI.
Supported formats: PDF, images (PNG, JPG, TIFF, BMP, WebP), text files (TXT, Markdown, LaTeX, DOCX), and source code (30+ languages).
How each format is processed:
| Source | What happens | Result |
|---|---|---|
| PDF (text pages) | Text extraction via PyMuPDF | Prose chunks |
| PDF (image pages) | OCR (local or cloud) | Prose chunks |
| Images | OCR (local or cloud) | Prose chunks |
| Text files | Split by headings / sections / paragraphs | Section chunks |
| Source code | Tree-sitter AST parsing (functions, classes) | Code chunks |
Installation
pip install quarry-mcp
quarry install
quarry install creates ~/.quarry/data/, downloads the embedding model, and writes MCP config for Claude Code and Claude Desktop.
Verify with quarry doctor:
✓ Python version: 3.13.1
✓ Data directory: /Users/you/.quarry/data/lancedb
✓ Local OCR: RapidOCR engine OK
○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
✓ Embedding model: snowflake-arctic-embed-m-v1.5 cached
✓ Core imports: 8 modules OK
Usage
CLI
# Ingest
quarry ingest report.pdf
quarry ingest whiteboard.jpg
quarry ingest src/main.py
quarry ingest report.pdf --overwrite
# Search
quarry search "authentication logic"
quarry search "quarterly revenue" -n 5
# Manage
quarry list
quarry delete report.pdf
quarry collections
quarry delete-collection math
# Directory sync — register a folder, then sync to pick up changes
quarry register /path/to/docs --collection my-docs
quarry sync
quarry registrations
quarry deregister my-docs
MCP Server
quarry install configures Claude Code and Claude Desktop automatically. Manual setup:
Claude Code:
claude mcp add quarry -- uvx --from quarry-mcp quarry mcp
Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"quarry": {
"command": "/path/to/uvx",
"args": ["--from", "quarry-mcp", "quarry", "mcp"]
}
}
}
Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx). quarry install resolves this automatically.
Available tools:
| Tool | Description |
|---|---|
search_documents |
Semantic search across indexed documents |
ingest |
Ingest a file (PDF, image, text, source code) |
ingest_text |
Index raw text content directly |
get_documents |
List indexed documents with metadata |
get_page |
Retrieve full text for a specific page |
delete_document |
Remove a document and its chunks |
delete_collection |
Remove all documents in a collection |
list_collections |
List collections with document/chunk counts |
register_directory |
Register a directory for sync |
deregister_directory |
Remove a directory registration |
sync_all_registrations |
Sync all registered directories |
list_registrations |
List registered directories |
status |
Database stats: counts, storage size, model info |
Claude Desktop note: Uploaded files live in a sandbox that Quarry cannot access. Use ingest_text with extracted content for uploads. For files on your Mac, provide the local path to ingest.
Configuration
All settings via environment variables:
| Variable | Default | Description |
|---|---|---|
OCR_BACKEND |
local |
local (RapidOCR, offline) or textract (AWS) |
LANCEDB_PATH |
~/.quarry/data/lancedb |
Vector database location |
CHUNK_MAX_CHARS |
1800 |
Target max characters per chunk (~450 tokens) |
CHUNK_OVERLAP_CHARS |
200 |
Overlap between consecutive chunks |
OCR Backends
Quarry ships with two OCR backends:
| Backend | Speed | Quality | Setup |
|---|---|---|---|
| local (default) | ~7-8s/page | Good for semantic search | None |
| textract | ~2-3s/page | Excellent character accuracy | AWS credentials + S3 bucket |
The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). No cloud account needed.
AWS Textract Setup
Only needed if you want cloud OCR. Set these environment variables:
| Variable | Default | Description |
|---|---|---|
AWS_ACCESS_KEY_ID |
AWS access key | |
AWS_SECRET_ACCESS_KEY |
AWS secret key | |
AWS_DEFAULT_REGION |
us-east-1 |
AWS region |
S3_BUCKET |
ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b |
S3 bucket for Textract uploads |
Your IAM user needs textract:DetectDocumentText, textract:StartDocumentTextDetection, textract:GetDocumentTextDetection, and s3:PutObject/GetObject/DeleteObject on your bucket.
Multiple Indices
Run separate MCP instances with different data directories:
{
"mcpServers": {
"legal-docs": {
"command": "/path/to/uvx",
"args": ["--from", "quarry-mcp", "quarry", "mcp"],
"env": { "LANCEDB_PATH": "/data/legal/lancedb" }
},
"financial-reports": {
"command": "/path/to/uvx",
"args": ["--from", "quarry-mcp", "quarry", "mcp"],
"env": { "LANCEDB_PATH": "/data/financial/lancedb" }
}
}
}
Advanced Configuration
| Variable | Default | Description |
|---|---|---|
TEXTRACT_POLL_INITIAL |
5.0 |
Initial Textract polling interval (seconds) |
TEXTRACT_POLL_MAX |
30.0 |
Max polling interval (1.5x exponential backoff) |
TEXTRACT_MAX_WAIT |
900 |
Max wait for Textract job (seconds) |
REGISTRY_PATH |
~/.quarry/data/registry.db |
Directory sync SQLite database |
Architecture
Connectors Formats Transformations
│ │ │
├─ Local filesystem ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
│ (register + sync) │ └─ image ─→ OCR (local or Textract)
│ │
└─ Google Drive ├─ Images ─────────────→ OCR (local or Textract)
(planned) │
├─ Text files ─────────→ Section-aware splitting
│
├─ Source code ─────────→ Tree-sitter AST splitting
│
└─ Raw text ───────────→ Direct chunking
│
Indexing
│
├─ Sentence-aware chunking
├─ Vector embeddings (768-dim)
└─ LanceDB storage
│
Query
│
├─ Semantic search
└─ Collection filtering
│
Interface
│
├─ MCP Server (stdio)
└─ CLI (typer + rich)
Roadmap
- Spreadsheets (XLSX, CSV) via tabular serialization
- Presentations (PPTX) with speaker notes
- HTML with structure-aware splitting
- Search filters by content type and file format
- Google Drive connector
Development
uv run ruff check .
uv run ruff format --check .
uv run mypy src/ tests/
uv run pytest
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quarry_mcp-0.4.2.tar.gz.
File metadata
- Download URL: quarry_mcp-0.4.2.tar.gz
- Upload date:
- Size: 33.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4728289966c41ea571d90c41240b7a5bdf169694e32a9cf493ebd9ae1c6a065c
|
|
| MD5 |
09cb5f14912286610e9fe4bc42312279
|
|
| BLAKE2b-256 |
0fced1ab046ae276995b6641f42039658b3e665eb807429c4a8aab83cd76d33e
|
File details
Details for the file quarry_mcp-0.4.2-py3-none-any.whl.
File metadata
- Download URL: quarry_mcp-0.4.2-py3-none-any.whl
- Upload date:
- Size: 42.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0b681e98463dbaa82b236c209af29b0750d1cfc75812d2170a2b1a552ee9858
|
|
| MD5 |
1470740d0016d6cbf669892e9fcc451d
|
|
| BLAKE2b-256 |
2ee33ad77beeaa6928419f37508a504eb6a2c6522bd82ebeaa6da853713ab237
|