Extract searchable knowledge from any document. Expose it to LLMs via MCP.

These details have not been verified by PyPI

Project links

Project description

quarry-mcp

Unlock the knowledge trapped on your hard drive. Works with Claude Code and Claude Desktop.

Quick Start

One-liner install (Python 3.10+ required):

curl -fsSL https://raw.githubusercontent.com/jmf-pobox/quarry-mcp/main/install.sh | bash

This installs uv (if needed), quarry-mcp, downloads the embedding model, and configures Claude Code and Claude Desktop.

Or install manually:

pip install quarry-mcp
quarry install          # downloads embedding model (~500MB), configures MCP

Then start using it:

quarry ingest-file notes.md  # index a file — no cloud account needed
quarry search "my topic"

That's it. Quarry works locally out of the box.

What It Does

You have years of knowledge buried in PDFs, scanned documents, notes, spreadsheets, and source code. Quarry extracts that knowledge, makes it searchable by meaning, and gives your LLM access to it.

This is not media search — Quarry doesn't find images or match audio. It reads every document the way you would, extracts the text and structure, and indexes the knowledge inside. A scanned whiteboard becomes searchable prose. A spreadsheet becomes structured data an LLM can reason about. Source code becomes semantic units an LLM can reference.

Supported formats: PDF, images (PNG, JPG, TIFF, BMP, WebP), presentations (PPTX), spreadsheets (XLSX, CSV), HTML, webpages (via URL), text files (TXT, Markdown, LaTeX, DOCX), and source code (30+ languages).

How each format is processed:

Source	What happens	Result
PDF (text pages)	Text extraction via PyMuPDF	Prose chunks
PDF (image pages)	OCR (local or cloud)	Prose chunks
Images	OCR (local or cloud)	Prose chunks
Spreadsheets	LaTeX tabular serialization via openpyxl	Tabular chunks
HTML	Boilerplate stripping, Markdown conversion	Section chunks
Presentations	Slide-per-chunk with tables as LaTeX via python-pptx	Slide chunks
Text files	Split by headings / sections / paragraphs	Section chunks
Source code	Tree-sitter AST parsing (functions, classes)	Code chunks

Every format is converted to text optimized for LLM consumption. Structured formats like spreadsheets and presentation tables are serialized to LaTeX to preserve tabular relationships while remaining token-efficient. The goal is always the same: turn your files into knowledge an LLM can use.

Installation

pip install quarry-mcp
quarry install

quarry install creates ~/.quarry/data/, downloads the embedding model, and writes MCP config for Claude Code and Claude Desktop.

Verify with quarry doctor:

quarry-mcp 0.5.0

  ✓ Python version: 3.13.1
  ✓ Data directory: /Users/you/.quarry/data/default/lancedb
  ✓ Local OCR: RapidOCR engine OK
  ○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
  ✓ Embedding model: snowflake-arctic-embed-m-v1.5 (ONNX INT8) cached (67.73 MB)
  ✓ Core imports: 9 modules OK
  ✓ Claude Code MCP: configured
  ✓ Claude Desktop MCP: configured
  ✓ Storage: 42.5 MB in /Users/you/.quarry/data

All checks passed.

Usage

Quarry exposes most operations through both a CLI and MCP tools, with gaps shown as — below. The CLI is for terminal use; MCP tools are what Claude Code and Claude Desktop call on your behalf.

Operation	CLI	MCP tool
Search
Semantic search	`quarry search "query"`	`search_documents`
Ingestion
Ingest a file	`quarry ingest-file <path>`	`ingest_file`
Ingest a URL	`quarry ingest-url <url>`	`ingest_url`
Ingest inline text	—	`ingest_content`
Documents
List documents	`quarry list`	`get_documents`
Get page text	—	`get_page`
Delete a document	`quarry delete <name>`	`delete_document`
Collections
List collections	`quarry collections`	`list_collections`
Delete a collection	`quarry delete-collection <name>`	`delete_collection`
Directory sync
Register a directory	`quarry register <path>`	`register_directory`
Sync all registrations	`quarry sync`	`sync_all_registrations`
List registrations	`quarry registrations`	`list_registrations`
Deregister	`quarry deregister <collection>`	`deregister_directory`
System
Database status	—	`status`
List databases	`quarry databases`	—
Install / setup	`quarry install`	—
Health check	`quarry doctor`	—
HTTP API server	`quarry serve`	—
MCP server	`quarry mcp`	—

CLI examples

quarry ingest-file report.pdf --overwrite          # replace existing data
quarry ingest-file report.pdf --db work            # target a named database
quarry search "revenue" --limit 5                  # limit results
quarry search "tests" --page-type code             # filter by content type
quarry search "revenue" --source-format .xlsx      # filter by source format
quarry register /path/to/docs --collection docs    # explicit collection name

MCP setup

quarry install configures Claude Code and Claude Desktop automatically. Manual setup:

Claude Code:

claude mcp add quarry -- uvx --from quarry-mcp quarry mcp

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "quarry": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"]
    }
  }
}

Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx). quarry install resolves this automatically.

Claude Desktop note: Uploaded files live in a sandbox that Quarry cannot access. Use ingest_content with extracted content for uploads. For files on your Mac, provide the local path to ingest_file.

Configuration

All settings via environment variables:

Variable	Default	Description
`OCR_BACKEND`	`local`	`local` (RapidOCR, offline) or `textract` (AWS)
`EMBEDDING_BACKEND`	`onnx`	`onnx` (local, offline) or `sagemaker` (AWS)
`QUARRY_ROOT`	`~/.quarry/data`	Base directory for all databases and logs
`LANCEDB_PATH`	`~/.quarry/data/default/lancedb`	Vector database location (overrides `--db`)
`REGISTRY_PATH`	`~/.quarry/data/default/registry.db`	Directory sync SQLite database
`LOG_PATH`	`~/.quarry/data/quarry.log`	Log file location (rotating, 5 MB max)
`CHUNK_MAX_CHARS`	`1800`	Target max characters per chunk (~450 tokens)
`CHUNK_OVERLAP_CHARS`	`200`	Overlap between consecutive chunks

OCR Backends

Quarry ships with two OCR backends:

Backend	Per-page OCR	End-to-end ingestion	Quality	Setup
local (default)	~7-8s/page	Faster overall	Good for semantic search	None
textract	~2-3s/page	Slower (network overhead)	Excellent character accuracy	AWS credentials + S3 bucket

The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). No cloud account needed.

When to use Textract: Degraded scans, faxes, or low-resolution images where RapidOCR struggles. For clean digital PDFs and presentations, local OCR produces identical search results (see Benchmarks).

AWS Textract Setup

Only needed if you want cloud OCR. Set these environment variables:

Variable	Default	Description
`AWS_ACCESS_KEY_ID`		AWS access key
`AWS_SECRET_ACCESS_KEY`		AWS secret key
`AWS_DEFAULT_REGION`	`us-east-1`	AWS region (must match S3 bucket region)
`S3_BUCKET`		S3 bucket for Textract uploads (must be in `AWS_DEFAULT_REGION`)

Your IAM user needs textract:DetectDocumentText, textract:StartDocumentTextDetection, textract:GetDocumentTextDetection, and s3:PutObject/GetObject/DeleteObject on your bucket. All AWS resources (S3 bucket, SageMaker endpoint) must be in the same region. See docs/AWS-SETUP.md for full setup including IAM policy and region strategy.

SageMaker Embedding Setup

Optional cloud embedding for ingestion. Search always uses local ONNX regardless of this setting. Local ONNX embedding sustains ~11 chunks/s on a laptop — sufficient for most workloads. SageMaker is designed for large-scale batch ingestion (thousands of files) where parallelism across workers offsets the network overhead. For small-to-medium collections, local is faster (see Benchmarks).

Deploy the endpoint (requires an AWS CLI profile with admin access):

./infra/manage-stack.sh deploy              # serverless (default, pay-per-request)
./infra/manage-stack.sh deploy realtime     # persistent instance (~$0.12/hr)

The script uses the QUARRY_DEPLOY_PROFILE env var (default: admin). Set it to match your AWS CLI profile name if different.

Configure quarry:

Variable	Default	Description
`EMBEDDING_BACKEND`	`onnx`	Set to `sagemaker` to use cloud embedding for ingestion
`SAGEMAKER_ENDPOINT_NAME`		Endpoint name (e.g. `quarry-embedding`)
`AWS_DEFAULT_REGION`	`us-east-1`	Must match the endpoint's deployed region

Tear down when not in use:

./infra/manage-stack.sh destroy

The serverless endpoint scales to zero when idle — you only pay per inference request. The realtime endpoint (~$0.12/hr) eliminates cold starts for sustained workloads. The management script packages a custom inference handler (CLS-token pooling + L2 normalization), uploads it to S3, and deploys the CloudFormation stack. See docs/AWS-SETUP.md for IAM setup and region strategy.

Named Databases

Use --db to keep separate databases for different projects:

quarry ingest-file report.pdf --db work
quarry ingest-file paper.pdf --db personal
quarry search "revenue" --db work
quarry databases  # list all databases with stats

Each database resolves to ~/.quarry/data/<name>/lancedb with its own registry. Start an MCP server against a named database:

{
  "mcpServers": {
    "work": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp", "--db", "work"]
    },
    "personal": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp", "--db", "personal"]
    }
  }
}

LANCEDB_PATH still works as an override for edge cases.

Advanced Configuration

Variable	Default	Description
`TEXTRACT_POLL_INITIAL`	`5.0`	Initial Textract polling interval (seconds)
`TEXTRACT_POLL_MAX`	`30.0`	Max polling interval (1.5x exponential backoff)
`TEXTRACT_MAX_WAIT`	`900`	Max wait for Textract job (seconds)
`TEXTRACT_MAX_IMAGE_BYTES`	`10485760`	Max image size for Textract sync API (10 MB)
`SAGEMAKER_ENDPOINT_NAME`		SageMaker endpoint for cloud embedding
`EMBEDDING_MODEL`	`Snowflake/snowflake-arctic-embed-m-v1.5`	Embedding model identifier (cache key)
`EMBEDDING_DIMENSION`	`768`	Embedding vector dimension

Benchmarks

Tested on 19 files (~44 MB) of university course material: clean digital PDFs and PPTX presentations. Single-threaded ingestion, M-series Mac.

Ingestion speed

Configuration	Time	vs Local
Local (RapidOCR + ONNX)	107s	1x
Cloud (Textract + SageMaker serverless)	983s	9.2x slower
Cloud (Textract + SageMaker realtime)	658s	6.1x slower

Search quality

Five test queries across both configurations returned identical top-1 results. Similarity scores differed by 0.003-0.04 between local INT8 and cloud FP32 embeddings — negligible for retrieval ranking.

When cloud backends help

Scenario	Recommendation
Clean digital PDFs, presentations, text	Local. Faster, free, same search quality.
Degraded scans, faxes, low-resolution images	Textract. Better character accuracy on noisy input.
Thousands of files, batch ingestion	SageMaker realtime. Parallelism across workers offsets overhead at scale.
Small-to-medium collections (<100 files)	Local. Cloud overhead dominates at this scale.

Architecture

Connectors                Formats              Transformations
  │                         │                        │
  └─ Local filesystem       ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
      (register + sync)     │            └─ image ─→ OCR (local or Textract)
                            │
                            ├─ Images ─────────────→ OCR (local or Textract)
                            │
                            ├─ Spreadsheets ───────→ LaTeX tabular serialization
                            │
                            ├─ Presentations ──────→ Slide extraction + LaTeX tables
                            │
                            ├─ HTML ───────────────→ Boilerplate stripping + Markdown
                            │
                            ├─ Text files ─────────→ Section-aware splitting
                            │   (TXT, MD, LaTeX, DOCX)
                            │
                            ├─ Source code ─────────→ Tree-sitter AST splitting
                            │
                            └─ Raw text ───────────→ Direct chunking
                                                         │
                                                  Indexing
                                                    │
                                                    ├─ Sentence-aware chunking
                                                    ├─ Chunk metadata (page_type, source_format)
                                                    ├─ Embedding (local ONNX or SageMaker cloud)
                                                    └─ LanceDB storage
                                                         │
                                                  Query
                                                    │
                                                    ├─ Semantic search
                                                    ├─ Collection filtering
                                                    └─ Format filtering (page_type, source_format)
                                                         │
                                                  Interface
                                                    │
                                                    ├─ MCP Server (stdio)
                                                    ├─ CLI (typer + rich)
                                                    └─ HTTP API (for quarry-menubar)

Library API

Quarry is fully typed (py.typed) and can be used as a Python library:

from pathlib import Path
from quarry import Settings, get_db, ingest_content, ingest_document, search
from quarry.backends import get_embedding_backend

# Load settings from environment variables
settings = Settings()
db = get_db(settings.lancedb_path)

# Ingest a file
result = ingest_document(Path("report.pdf"), db, settings, collection="work")

# Ingest inline content
result = ingest_content("Quarterly revenue was $4.2M.", "notes.txt", db, settings)

# Search
backend = get_embedding_backend(settings)
vector = backend.embed_query("revenue figures")
results = search(db, vector, limit=5, collection_filter="work")
for r in results:
    print(r["text"], r["_distance"])

The public API surface is in quarry/__init__.py. Pipeline functions accept a progress_callback: Callable[[str], None] for status updates during ingestion.

Roadmap

macOS menu bar companion app — native macOS search interface (in development)
Google Drive connector
quarry sync --watch for live filesystem monitoring
PII detection and redaction

For product vision and positioning, see PR/FAQ.

Development

uv run ruff check .
uv run ruff format --check .
uv run mypy src/ tests/
uv run pytest

See CONTRIBUTING.md for setup, architecture, and how to add new formats.

Documentation

Changelog
AWS Setup Guide -- IAM, S3, SageMaker deployment, region strategy
Search Quality and Tuning
Backend Abstraction Design
Non-Functional Design
PR/FAQ -- product vision and positioning

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Feb 15, 2026

0.4.2

Feb 12, 2026

0.4.1

Feb 12, 2026

0.4.0

Feb 12, 2026

0.3.0

Feb 10, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

0.1.3

Feb 9, 2026

0.1.2

Feb 8, 2026

0.1.1

Feb 8, 2026

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.6.0.tar.gz (53.1 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quarry_mcp-0.6.0-py3-none-any.whl (66.0 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file quarry_mcp-0.6.0.tar.gz.

File metadata

Download URL: quarry_mcp-0.6.0.tar.gz
Upload date: Feb 15, 2026
Size: 53.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for quarry_mcp-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`7dc8bc062b6306200bfd3b192b17ab497e37b278d959fdd8ad30157e91c64042`
MD5	`8495a8dc4840520e7a0d84eea8f18164`
BLAKE2b-256	`29a7af6a3342fa0060ea5ad4cc104223e43198cff66adcdca4922a0e4840b777`

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.6.0-py3-none-any.whl.

File metadata

Download URL: quarry_mcp-0.6.0-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 66.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for quarry_mcp-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc97319d9fbbc221c7275961dc5ba94bc28c69366b553c51bf41d01f388e5f74`
MD5	`995be4a180cd50139bda206d7dc89837`
BLAKE2b-256	`eea6a41bd580c5bf13ab58c81346f18b0c21b3612d0d99192680898946153007`

See more details on using hashes here.

quarry-mcp 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quarry-mcp

Quick Start

What It Does

Installation

Usage

CLI examples

MCP setup

Configuration

OCR Backends

AWS Textract Setup

SageMaker Embedding Setup

Named Databases

Advanced Configuration

Benchmarks

Ingestion speed

Search quality

When cloud backends help

Architecture

Library API

Roadmap

Development

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes