Skip to main content

Extract searchable knowledge from any document. Expose it to LLMs via MCP.

Project description

quarry-mcp

PyPI GitHub release Python 3.13+ Tests Lint codecov

Unlock the knowledge trapped on your hard drive. Works with Claude Code and Claude Desktop.

Quick Start

One-liner install (Python 3.10+ required):

curl -fsSL https://raw.githubusercontent.com/jmf-pobox/quarry-mcp/main/install.sh | bash

This installs uv (if needed), quarry-mcp, downloads the embedding model, and configures Claude Code and Claude Desktop.

Or install manually:

pip install quarry-mcp
quarry install          # downloads embedding model (~500MB), configures MCP

Then start using it:

quarry ingest-file notes.md  # index a file — no cloud account needed
quarry search "my topic"

That's it. Quarry works locally out of the box.

What It Does

You have years of knowledge buried in PDFs, scanned documents, notes, spreadsheets, and source code. Quarry extracts that knowledge, makes it searchable by meaning, and gives your LLM access to it.

This is not media search — Quarry doesn't find images or match audio. It reads every document the way you would, extracts the text and structure, and indexes the knowledge inside. A scanned whiteboard becomes searchable prose. A spreadsheet becomes structured data an LLM can reason about. Source code becomes semantic units an LLM can reference.

Supported formats: PDF, images (PNG, JPG, TIFF, BMP, WebP), presentations (PPTX), spreadsheets (XLSX, CSV), HTML, webpages (via URL), text files (TXT, Markdown, LaTeX, DOCX), and source code (30+ languages).

How each format is processed:

Source What happens Result
PDF (text pages) Text extraction via PyMuPDF Prose chunks
PDF (image pages) OCR (local or cloud) Prose chunks
Images OCR (local or cloud) Prose chunks
Spreadsheets LaTeX tabular serialization via openpyxl Tabular chunks
HTML Boilerplate stripping, Markdown conversion Section chunks
Presentations Slide-per-chunk with tables as LaTeX via python-pptx Slide chunks
Text files Split by headings / sections / paragraphs Section chunks
Source code Tree-sitter AST parsing (functions, classes) Code chunks

Every format is converted to text optimized for LLM consumption. Structured formats like spreadsheets and presentation tables are serialized to LaTeX to preserve tabular relationships while remaining token-efficient. The goal is always the same: turn your files into knowledge an LLM can use.

Installation

pip install quarry-mcp
quarry install

quarry install creates ~/.quarry/data/, downloads the embedding model, and writes MCP config for Claude Code and Claude Desktop.

Verify with quarry doctor:

quarry-mcp 0.5.0

  ✓ Python version: 3.13.1
  ✓ Data directory: /Users/you/.quarry/data/default/lancedb
  ✓ Local OCR: RapidOCR engine OK
  ○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
  ✓ Embedding model: snowflake-arctic-embed-m-v1.5 (ONNX INT8) cached (67.73 MB)
  ✓ Core imports: 9 modules OK
  ✓ Claude Code MCP: configured
  ✓ Claude Desktop MCP: configured
  ✓ Storage: 42.5 MB in /Users/you/.quarry/data

All checks passed.

Usage

Quarry exposes most operations through both a CLI and MCP tools, with gaps shown as below. The CLI is for terminal use; MCP tools are what Claude Code and Claude Desktop call on your behalf.

Operation CLI MCP tool
Search
Semantic search quarry search "query" search_documents
Ingestion
Ingest a file quarry ingest-file <path> ingest_file
Ingest a URL quarry ingest-url <url> ingest_url
Ingest inline text ingest_content
Documents
List documents quarry list get_documents
Get page text get_page
Delete a document quarry delete <name> delete_document
Collections
List collections quarry collections list_collections
Delete a collection quarry delete-collection <name> delete_collection
Directory sync
Register a directory quarry register <path> register_directory
Sync all registrations quarry sync sync_all_registrations
List registrations quarry registrations list_registrations
Deregister quarry deregister <collection> deregister_directory
System
Database status status
List databases quarry databases
Install / setup quarry install
Health check quarry doctor
HTTP API server quarry serve
MCP server quarry mcp

CLI examples

quarry ingest-file report.pdf --overwrite          # replace existing data
quarry ingest-file report.pdf --db work            # target a named database
quarry search "revenue" --limit 5                  # limit results
quarry search "tests" --page-type code             # filter by content type
quarry search "revenue" --source-format .xlsx      # filter by source format
quarry register /path/to/docs --collection docs    # explicit collection name

MCP setup

quarry install configures Claude Code and Claude Desktop automatically. Manual setup:

Claude Code:

claude mcp add quarry -- uvx --from quarry-mcp quarry mcp

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "quarry": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"]
    }
  }
}

Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx). quarry install resolves this automatically.

Claude Desktop note: Uploaded files live in a sandbox that Quarry cannot access. Use ingest_content with extracted content for uploads. For files on your Mac, provide the local path to ingest_file.

Configuration

All settings via environment variables:

Variable Default Description
OCR_BACKEND local local (RapidOCR, offline) or textract (AWS)
EMBEDDING_BACKEND onnx onnx (local, offline) or sagemaker (AWS)
QUARRY_ROOT ~/.quarry/data Base directory for all databases and logs
LANCEDB_PATH ~/.quarry/data/default/lancedb Vector database location (overrides --db)
REGISTRY_PATH ~/.quarry/data/default/registry.db Directory sync SQLite database
LOG_PATH ~/.quarry/data/quarry.log Log file location (rotating, 5 MB max)
CHUNK_MAX_CHARS 1800 Target max characters per chunk (~450 tokens)
CHUNK_OVERLAP_CHARS 200 Overlap between consecutive chunks

OCR Backends

Quarry ships with two OCR backends:

Backend Per-page OCR End-to-end ingestion Quality Setup
local (default) ~7-8s/page Faster overall Good for semantic search None
textract ~2-3s/page Slower (network overhead) Excellent character accuracy AWS credentials + S3 bucket

The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). No cloud account needed.

When to use Textract: Degraded scans, faxes, or low-resolution images where RapidOCR struggles. For clean digital PDFs and presentations, local OCR produces identical search results (see Benchmarks).

AWS Textract Setup

Only needed if you want cloud OCR. Set these environment variables:

Variable Default Description
AWS_ACCESS_KEY_ID AWS access key
AWS_SECRET_ACCESS_KEY AWS secret key
AWS_DEFAULT_REGION us-east-1 AWS region (must match S3 bucket region)
S3_BUCKET S3 bucket for Textract uploads (must be in AWS_DEFAULT_REGION)

Your IAM user needs textract:DetectDocumentText, textract:StartDocumentTextDetection, textract:GetDocumentTextDetection, and s3:PutObject/GetObject/DeleteObject on your bucket. All AWS resources (S3 bucket, SageMaker endpoint) must be in the same region. See docs/AWS-SETUP.md for full setup including IAM policy and region strategy.

SageMaker Embedding Setup

Optional cloud embedding for ingestion. Search always uses local ONNX regardless of this setting. Local ONNX embedding sustains ~11 chunks/s on a laptop — sufficient for most workloads. SageMaker is designed for large-scale batch ingestion (thousands of files) where parallelism across workers offsets the network overhead. For small-to-medium collections, local is faster (see Benchmarks).

  1. Deploy the endpoint (requires an AWS CLI profile with admin access):
./infra/manage-stack.sh deploy              # serverless (default, pay-per-request)
./infra/manage-stack.sh deploy realtime     # persistent instance (~$0.12/hr)

The script uses the QUARRY_DEPLOY_PROFILE env var (default: admin). Set it to match your AWS CLI profile name if different.

  1. Configure quarry:
Variable Default Description
EMBEDDING_BACKEND onnx Set to sagemaker to use cloud embedding for ingestion
SAGEMAKER_ENDPOINT_NAME Endpoint name (e.g. quarry-embedding)
AWS_DEFAULT_REGION us-east-1 Must match the endpoint's deployed region
  1. Tear down when not in use:
./infra/manage-stack.sh destroy

The serverless endpoint scales to zero when idle — you only pay per inference request. The realtime endpoint (~$0.12/hr) eliminates cold starts for sustained workloads. The management script packages a custom inference handler (CLS-token pooling + L2 normalization), uploads it to S3, and deploys the CloudFormation stack. See docs/AWS-SETUP.md for IAM setup and region strategy.

Named Databases

Use --db to keep separate databases for different projects:

quarry ingest-file report.pdf --db work
quarry ingest-file paper.pdf --db personal
quarry search "revenue" --db work
quarry databases  # list all databases with stats

Each database resolves to ~/.quarry/data/<name>/lancedb with its own registry. Start an MCP server against a named database:

{
  "mcpServers": {
    "work": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp", "--db", "work"]
    },
    "personal": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp", "--db", "personal"]
    }
  }
}

LANCEDB_PATH still works as an override for edge cases.

Advanced Configuration

Variable Default Description
TEXTRACT_POLL_INITIAL 5.0 Initial Textract polling interval (seconds)
TEXTRACT_POLL_MAX 30.0 Max polling interval (1.5x exponential backoff)
TEXTRACT_MAX_WAIT 900 Max wait for Textract job (seconds)
TEXTRACT_MAX_IMAGE_BYTES 10485760 Max image size for Textract sync API (10 MB)
SAGEMAKER_ENDPOINT_NAME SageMaker endpoint for cloud embedding
EMBEDDING_MODEL Snowflake/snowflake-arctic-embed-m-v1.5 Embedding model identifier (cache key)
EMBEDDING_DIMENSION 768 Embedding vector dimension

Benchmarks

Tested on 19 files (~44 MB) of university course material: clean digital PDFs and PPTX presentations. Single-threaded ingestion, M-series Mac.

Ingestion speed

Configuration Time vs Local
Local (RapidOCR + ONNX) 107s 1x
Cloud (Textract + SageMaker serverless) 983s 9.2x slower
Cloud (Textract + SageMaker realtime) 658s 6.1x slower

Search quality

Five test queries across both configurations returned identical top-1 results. Similarity scores differed by 0.003-0.04 between local INT8 and cloud FP32 embeddings — negligible for retrieval ranking.

When cloud backends help

Scenario Recommendation
Clean digital PDFs, presentations, text Local. Faster, free, same search quality.
Degraded scans, faxes, low-resolution images Textract. Better character accuracy on noisy input.
Thousands of files, batch ingestion SageMaker realtime. Parallelism across workers offsets overhead at scale.
Small-to-medium collections (<100 files) Local. Cloud overhead dominates at this scale.

Architecture

Connectors                Formats              Transformations
  │                         │                        │
  └─ Local filesystem       ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
      (register + sync)     │            └─ image ─→ OCR (local or Textract)
                            │
                            ├─ Images ─────────────→ OCR (local or Textract)
                            │
                            ├─ Spreadsheets ───────→ LaTeX tabular serialization
                            │
                            ├─ Presentations ──────→ Slide extraction + LaTeX tables
                            │
                            ├─ HTML ───────────────→ Boilerplate stripping + Markdown
                            │
                            ├─ Text files ─────────→ Section-aware splitting
                            │   (TXT, MD, LaTeX, DOCX)
                            │
                            ├─ Source code ─────────→ Tree-sitter AST splitting
                            │
                            └─ Raw text ───────────→ Direct chunking
                                                         │
                                                  Indexing
                                                    │
                                                    ├─ Sentence-aware chunking
                                                    ├─ Chunk metadata (page_type, source_format)
                                                    ├─ Embedding (local ONNX or SageMaker cloud)
                                                    └─ LanceDB storage
                                                         │
                                                  Query
                                                    │
                                                    ├─ Semantic search
                                                    ├─ Collection filtering
                                                    └─ Format filtering (page_type, source_format)
                                                         │
                                                  Interface
                                                    │
                                                    ├─ MCP Server (stdio)
                                                    ├─ CLI (typer + rich)
                                                    └─ HTTP API (for quarry-menubar)

Library API

Quarry is fully typed (py.typed) and can be used as a Python library:

from pathlib import Path
from quarry import Settings, get_db, ingest_content, ingest_document, search
from quarry.backends import get_embedding_backend

# Load settings from environment variables
settings = Settings()
db = get_db(settings.lancedb_path)

# Ingest a file
result = ingest_document(Path("report.pdf"), db, settings, collection="work")

# Ingest inline content
result = ingest_content("Quarterly revenue was $4.2M.", "notes.txt", db, settings)

# Search
backend = get_embedding_backend(settings)
vector = backend.embed_query("revenue figures")
results = search(db, vector, limit=5, collection_filter="work")
for r in results:
    print(r["text"], r["_distance"])

The public API surface is in quarry/__init__.py. Pipeline functions accept a progress_callback: Callable[[str], None] for status updates during ingestion.

Roadmap

  • macOS menu bar companion app — native macOS search interface (in development)
  • Google Drive connector
  • quarry sync --watch for live filesystem monitoring
  • PII detection and redaction

For product vision and positioning, see PR/FAQ.

Development

uv run ruff check .
uv run ruff format --check .
uv run mypy src/ tests/
uv run pytest

See CONTRIBUTING.md for setup, architecture, and how to add new formats.

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.6.0.tar.gz (53.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quarry_mcp-0.6.0-py3-none-any.whl (66.0 kB view details)

Uploaded Python 3

File details

Details for the file quarry_mcp-0.6.0.tar.gz.

File metadata

  • Download URL: quarry_mcp-0.6.0.tar.gz
  • Upload date:
  • Size: 53.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for quarry_mcp-0.6.0.tar.gz
Algorithm Hash digest
SHA256 7dc8bc062b6306200bfd3b192b17ab497e37b278d959fdd8ad30157e91c64042
MD5 8495a8dc4840520e7a0d84eea8f18164
BLAKE2b-256 29a7af6a3342fa0060ea5ad4cc104223e43198cff66adcdca4922a0e4840b777

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: quarry_mcp-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 66.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for quarry_mcp-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc97319d9fbbc221c7275961dc5ba94bc28c69366b553c51bf41d01f388e5f74
MD5 995be4a180cd50139bda206d7dc89837
BLAKE2b-256 eea6a41bd580c5bf13ab58c81346f18b0c21b3612d0d99192680898946153007

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page