Extract searchable knowledge from any document. Expose it to LLMs via MCP.
Project description
quarry-mcp
Unlock the knowledge trapped on your hard drive. Works with Claude Code and Claude Desktop.
Quick Start
One-liner install (Python 3.10+ required):
curl -fsSL https://raw.githubusercontent.com/jmf-pobox/quarry-mcp/main/install.sh | bash
This installs uv (if needed), quarry-mcp, downloads the embedding model, and configures Claude Code and Claude Desktop.
Or install manually:
pip install quarry-mcp
quarry install # downloads embedding model (~500MB), configures MCP
Then start using it:
quarry ingest-file notes.md # index a file — no cloud account needed
quarry search "my topic"
That's it. Quarry works locally out of the box.
What It Does
You have years of knowledge buried in PDFs, scanned documents, notes, spreadsheets, and source code. Quarry extracts that knowledge, makes it searchable by meaning, and gives your LLM access to it.
This is not media search — Quarry doesn't find images or match audio. It reads every document the way you would, extracts the text and structure, and indexes the knowledge inside. A scanned whiteboard becomes searchable prose. A spreadsheet becomes structured data an LLM can reason about. Source code becomes semantic units an LLM can reference.
Supported formats: PDF, images (PNG, JPG, TIFF, BMP, WebP), presentations (PPTX), spreadsheets (XLSX, CSV), HTML, webpages (via URL), text files (TXT, Markdown, LaTeX, DOCX), and source code (30+ languages).
How each format is processed:
| Source | What happens | Result |
|---|---|---|
| PDF (text pages) | Text extraction via PyMuPDF | Prose chunks |
| PDF (image pages) | OCR (local or cloud) | Prose chunks |
| Images | OCR (local or cloud) | Prose chunks |
| Spreadsheets | LaTeX tabular serialization via openpyxl | Tabular chunks |
| HTML | Boilerplate stripping, Markdown conversion | Section chunks |
| Presentations | Slide-per-chunk with tables as LaTeX via python-pptx | Slide chunks |
| Text files | Split by headings / sections / paragraphs | Section chunks |
| Source code | Tree-sitter AST parsing (functions, classes) | Code chunks |
Every format is converted to text optimized for LLM consumption. Structured formats like spreadsheets and presentation tables are serialized to LaTeX to preserve tabular relationships while remaining token-efficient. The goal is always the same: turn your files into knowledge an LLM can use.
Installation
pip install quarry-mcp
quarry install
quarry install creates ~/.quarry/data/, downloads the embedding model, and writes MCP config for Claude Code and Claude Desktop.
Verify with quarry doctor:
quarry-mcp 0.5.0
✓ Python version: 3.13.1
✓ Data directory: /Users/you/.quarry/data/default/lancedb
✓ Local OCR: RapidOCR engine OK
○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
✓ Embedding model: snowflake-arctic-embed-m-v1.5 (ONNX INT8) cached (67.73 MB)
✓ Core imports: 9 modules OK
✓ Claude Code MCP: configured
✓ Claude Desktop MCP: configured
✓ Storage: 42.5 MB in /Users/you/.quarry/data
All checks passed.
Usage
Quarry exposes most operations through both a CLI and MCP tools, with gaps shown as — below. The CLI is for terminal use; MCP tools are what Claude Code and Claude Desktop call on your behalf.
| Operation | CLI | MCP tool |
|---|---|---|
| Search | ||
| Semantic search | quarry search "query" |
search_documents |
| Ingestion | ||
| Ingest a file | quarry ingest-file <path> |
ingest_file |
| Ingest a URL | quarry ingest-url <url> |
ingest_url |
| Ingest inline text | — | ingest_content |
| Documents | ||
| List documents | quarry list |
get_documents |
| Get page text | — | get_page |
| Delete a document | quarry delete <name> |
delete_document |
| Collections | ||
| List collections | quarry collections |
list_collections |
| Delete a collection | quarry delete-collection <name> |
delete_collection |
| Directory sync | ||
| Register a directory | quarry register <path> |
register_directory |
| Sync all registrations | quarry sync |
sync_all_registrations |
| List registrations | quarry registrations |
list_registrations |
| Deregister | quarry deregister <collection> |
deregister_directory |
| System | ||
| Database status | — | status |
| List databases | quarry databases |
— |
| Install / setup | quarry install |
— |
| Health check | quarry doctor |
— |
| HTTP API server | quarry serve |
— |
| MCP server | quarry mcp |
— |
CLI examples
quarry ingest-file report.pdf --overwrite # replace existing data
quarry ingest-file report.pdf --db work # target a named database
quarry search "revenue" --limit 5 # limit results
quarry search "tests" --page-type code # filter by content type
quarry search "revenue" --source-format .xlsx # filter by source format
quarry register /path/to/docs --collection docs # explicit collection name
MCP setup
quarry install configures Claude Code and Claude Desktop automatically. Manual setup:
Claude Code:
claude mcp add quarry -- uvx --from quarry-mcp quarry mcp
Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"quarry": {
"command": "/path/to/uvx",
"args": ["--from", "quarry-mcp", "quarry", "mcp"]
}
}
}
Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx). quarry install resolves this automatically.
Claude Desktop note: Uploaded files live in a sandbox that Quarry cannot access. Use ingest_content with extracted content for uploads. For files on your Mac, provide the local path to ingest_file.
Configuration
All settings via environment variables:
| Variable | Default | Description |
|---|---|---|
OCR_BACKEND |
local |
local (RapidOCR, offline) or textract (AWS) |
EMBEDDING_BACKEND |
onnx |
onnx (local, offline) or sagemaker (AWS) |
QUARRY_ROOT |
~/.quarry/data |
Base directory for all databases and logs |
LANCEDB_PATH |
~/.quarry/data/default/lancedb |
Vector database location (overrides --db) |
REGISTRY_PATH |
~/.quarry/data/default/registry.db |
Directory sync SQLite database |
LOG_PATH |
~/.quarry/data/quarry.log |
Log file location (rotating, 5 MB max) |
CHUNK_MAX_CHARS |
1800 |
Target max characters per chunk (~450 tokens) |
CHUNK_OVERLAP_CHARS |
200 |
Overlap between consecutive chunks |
OCR Backends
Quarry ships with two OCR backends:
| Backend | Per-page OCR | End-to-end ingestion | Quality | Setup |
|---|---|---|---|---|
| local (default) | ~7-8s/page | Faster overall | Good for semantic search | None |
| textract | ~2-3s/page | Slower (network overhead) | Excellent character accuracy | AWS credentials + S3 bucket |
The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). No cloud account needed.
When to use Textract: Degraded scans, faxes, or low-resolution images where RapidOCR struggles. For clean digital PDFs and presentations, local OCR produces identical search results (see Benchmarks).
AWS Textract Setup
Only needed if you want cloud OCR. Set these environment variables:
| Variable | Default | Description |
|---|---|---|
AWS_ACCESS_KEY_ID |
AWS access key | |
AWS_SECRET_ACCESS_KEY |
AWS secret key | |
AWS_DEFAULT_REGION |
us-east-1 |
AWS region (must match S3 bucket region) |
S3_BUCKET |
S3 bucket for Textract uploads (must be in AWS_DEFAULT_REGION) |
Your IAM user needs textract:DetectDocumentText, textract:StartDocumentTextDetection, textract:GetDocumentTextDetection, and s3:PutObject/GetObject/DeleteObject on your bucket. All AWS resources (S3 bucket, SageMaker endpoint) must be in the same region. See docs/AWS-SETUP.md for full setup including IAM policy and region strategy.
SageMaker Embedding Setup
Optional cloud embedding for ingestion. Search always uses local ONNX regardless of this setting. Local ONNX embedding sustains ~11 chunks/s on a laptop — sufficient for most workloads. SageMaker is designed for large-scale batch ingestion (thousands of files) where parallelism across workers offsets the network overhead. For small-to-medium collections, local is faster (see Benchmarks).
- Deploy the endpoint (requires an AWS CLI profile with admin access):
./infra/manage-stack.sh deploy # serverless (default, pay-per-request)
./infra/manage-stack.sh deploy realtime # persistent instance (~$0.12/hr)
The script uses the QUARRY_DEPLOY_PROFILE env var (default: admin). Set it to match your AWS CLI profile name if different.
- Configure quarry:
| Variable | Default | Description |
|---|---|---|
EMBEDDING_BACKEND |
onnx |
Set to sagemaker to use cloud embedding for ingestion |
SAGEMAKER_ENDPOINT_NAME |
Endpoint name (e.g. quarry-embedding) |
|
AWS_DEFAULT_REGION |
us-east-1 |
Must match the endpoint's deployed region |
- Tear down when not in use:
./infra/manage-stack.sh destroy
The serverless endpoint scales to zero when idle — you only pay per inference request. The realtime endpoint (~$0.12/hr) eliminates cold starts for sustained workloads. The management script packages a custom inference handler (CLS-token pooling + L2 normalization), uploads it to S3, and deploys the CloudFormation stack. See docs/AWS-SETUP.md for IAM setup and region strategy.
Named Databases
Use --db to keep separate databases for different projects:
quarry ingest-file report.pdf --db work
quarry ingest-file paper.pdf --db personal
quarry search "revenue" --db work
quarry databases # list all databases with stats
Each database resolves to ~/.quarry/data/<name>/lancedb with its own registry. Start an MCP server against a named database:
{
"mcpServers": {
"work": {
"command": "/path/to/uvx",
"args": ["--from", "quarry-mcp", "quarry", "mcp", "--db", "work"]
},
"personal": {
"command": "/path/to/uvx",
"args": ["--from", "quarry-mcp", "quarry", "mcp", "--db", "personal"]
}
}
}
LANCEDB_PATH still works as an override for edge cases.
Advanced Configuration
| Variable | Default | Description |
|---|---|---|
TEXTRACT_POLL_INITIAL |
5.0 |
Initial Textract polling interval (seconds) |
TEXTRACT_POLL_MAX |
30.0 |
Max polling interval (1.5x exponential backoff) |
TEXTRACT_MAX_WAIT |
900 |
Max wait for Textract job (seconds) |
TEXTRACT_MAX_IMAGE_BYTES |
10485760 |
Max image size for Textract sync API (10 MB) |
SAGEMAKER_ENDPOINT_NAME |
SageMaker endpoint for cloud embedding | |
EMBEDDING_MODEL |
Snowflake/snowflake-arctic-embed-m-v1.5 |
Embedding model identifier (cache key) |
EMBEDDING_DIMENSION |
768 |
Embedding vector dimension |
Benchmarks
Tested on 19 files (~44 MB) of university course material: clean digital PDFs and PPTX presentations. Single-threaded ingestion, M-series Mac.
Ingestion speed
| Configuration | Time | vs Local |
|---|---|---|
| Local (RapidOCR + ONNX) | 107s | 1x |
| Cloud (Textract + SageMaker serverless) | 983s | 9.2x slower |
| Cloud (Textract + SageMaker realtime) | 658s | 6.1x slower |
Search quality
Five test queries across both configurations returned identical top-1 results. Similarity scores differed by 0.003-0.04 between local INT8 and cloud FP32 embeddings — negligible for retrieval ranking.
When cloud backends help
| Scenario | Recommendation |
|---|---|
| Clean digital PDFs, presentations, text | Local. Faster, free, same search quality. |
| Degraded scans, faxes, low-resolution images | Textract. Better character accuracy on noisy input. |
| Thousands of files, batch ingestion | SageMaker realtime. Parallelism across workers offsets overhead at scale. |
| Small-to-medium collections (<100 files) | Local. Cloud overhead dominates at this scale. |
Architecture
Connectors Formats Transformations
│ │ │
└─ Local filesystem ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
(register + sync) │ └─ image ─→ OCR (local or Textract)
│
├─ Images ─────────────→ OCR (local or Textract)
│
├─ Spreadsheets ───────→ LaTeX tabular serialization
│
├─ Presentations ──────→ Slide extraction + LaTeX tables
│
├─ HTML ───────────────→ Boilerplate stripping + Markdown
│
├─ Text files ─────────→ Section-aware splitting
│ (TXT, MD, LaTeX, DOCX)
│
├─ Source code ─────────→ Tree-sitter AST splitting
│
└─ Raw text ───────────→ Direct chunking
│
Indexing
│
├─ Sentence-aware chunking
├─ Chunk metadata (page_type, source_format)
├─ Embedding (local ONNX or SageMaker cloud)
└─ LanceDB storage
│
Query
│
├─ Semantic search
├─ Collection filtering
└─ Format filtering (page_type, source_format)
│
Interface
│
├─ MCP Server (stdio)
├─ CLI (typer + rich)
└─ HTTP API (for quarry-menubar)
Library API
Quarry is fully typed (py.typed) and can be used as a Python library:
from pathlib import Path
from quarry import Settings, get_db, ingest_content, ingest_document, search
from quarry.backends import get_embedding_backend
# Load settings from environment variables
settings = Settings()
db = get_db(settings.lancedb_path)
# Ingest a file
result = ingest_document(Path("report.pdf"), db, settings, collection="work")
# Ingest inline content
result = ingest_content("Quarterly revenue was $4.2M.", "notes.txt", db, settings)
# Search
backend = get_embedding_backend(settings)
vector = backend.embed_query("revenue figures")
results = search(db, vector, limit=5, collection_filter="work")
for r in results:
print(r["text"], r["_distance"])
The public API surface is in quarry/__init__.py. Pipeline functions accept a progress_callback: Callable[[str], None] for status updates during ingestion.
Roadmap
- macOS menu bar companion app — native macOS search interface (in development)
- Google Drive connector
quarry sync --watchfor live filesystem monitoring- PII detection and redaction
For product vision and positioning, see PR/FAQ.
Development
uv run ruff check .
uv run ruff format --check .
uv run mypy src/ tests/
uv run pytest
See CONTRIBUTING.md for setup, architecture, and how to add new formats.
Documentation
- Changelog
- AWS Setup Guide -- IAM, S3, SageMaker deployment, region strategy
- Search Quality and Tuning
- Backend Abstraction Design
- Non-Functional Design
- PR/FAQ -- product vision and positioning
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quarry_mcp-0.6.0.tar.gz.
File metadata
- Download URL: quarry_mcp-0.6.0.tar.gz
- Upload date:
- Size: 53.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dc8bc062b6306200bfd3b192b17ab497e37b278d959fdd8ad30157e91c64042
|
|
| MD5 |
8495a8dc4840520e7a0d84eea8f18164
|
|
| BLAKE2b-256 |
29a7af6a3342fa0060ea5ad4cc104223e43198cff66adcdca4922a0e4840b777
|
File details
Details for the file quarry_mcp-0.6.0-py3-none-any.whl.
File metadata
- Download URL: quarry_mcp-0.6.0-py3-none-any.whl
- Upload date:
- Size: 66.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc97319d9fbbc221c7275961dc5ba94bc28c69366b553c51bf41d01f388e5f74
|
|
| MD5 |
995be4a180cd50139bda206d7dc89837
|
|
| BLAKE2b-256 |
eea6a41bd580c5bf13ab58c81346f18b0c21b3612d0d99192680898946153007
|