Extract searchable knowledge from any document. Expose it to LLMs via MCP.

These details have not been verified by PyPI

Project links

Project description

quarry-mcp

A document intelligence pipeline for LLMs. Ingest anything, search everything.

Quarry transforms documents into searchable knowledge through format-aware ingestion, intelligent content transformations, local vector indexing, and semantic search — exposed as both an MCP server and a CLI.

Why Quarry?

Most RAG tools handle plain text. Quarry handles the full spectrum:

Capability	What Quarry does	Typical RAG tool
Formats	PDF, images, code, text, spreadsheets (planned)	Text files only
Transformations	OCR, tree-sitter AST splitting, LaTeX tabular (planned)	None — expects pre-processed text
Indexing	Vector embeddings, incremental sync, collections	Basic embedding
Query	Semantic search, format filters (planned), full page context	Vector similarity only
Interface	MCP server + CLI	Usually one or the other
OCR	Local (RapidOCR, offline) and cloud (AWS Textract)	None

Quarry works out of the box with local capabilities — no cloud accounts or API keys required to get started. For demanding use cases (OCR on scanned documents, large-scale ingestion), cloud backends like AWS Textract are available as drop-in upgrades. The goal: simple enough for non-engineers, complete enough for the most demanding personal knowledge bases.

Capabilities

Formats

Quarry ingests these document types today:

PDF — automatic text/image classification per page. Text pages use PyMuPDF; image pages route through OCR.
Images — PNG, JPG, TIFF (multi-page), BMP, WebP.
Text files — TXT, Markdown, LaTeX, DOCX. Section-aware splitting by headings and structure.
Source code — 30+ languages via tree-sitter AST splitting. Functions, classes, and imports become semantic sections.
Raw text — paste content directly via ingest_text for uploads or clipboard.

Transformations

Each format goes through a content-specific transformation before indexing:

Source	Transformation	Output
PDF (text pages)	PyMuPDF extraction	Prose chunks
PDF (image pages)	Local OCR (RapidOCR) or AWS Textract	Prose chunks
Images	Local OCR (RapidOCR) or AWS Textract	Prose chunks
Text files	Section-aware splitting (headings, `\section{}`, paragraphs)	Section chunks
Source code	Tree-sitter AST parsing (functions, classes, imports)	Code chunks
Spreadsheets (planned)	pandas → LaTeX tabular	Tabular chunks
Presentations (planned)	Slide + speaker notes extraction	Slide chunks

Connectors

Local filesystem — ingest individual files or register directories for incremental sync. Detects new, changed, and deleted files via mtime+size comparison. Parallel ingestion via ThreadPoolExecutor.
Google Drive (planned) — cloud document source.

Indexing

Vector embeddings — snowflake-arctic-embed-m-v1.5 (768-dim), runs locally.
Sentence-aware chunking — 1800-char target with 200-char overlap. Preserves sentence boundaries.
Incremental sync — register directories, sync on demand. Only re-indexes changed files.
Collections — organize documents by project, topic, or source.
Full page context — each chunk retains the complete page text for LLM reference.

Query

Semantic search — vector similarity across all indexed documents.
Collection filtering — scope searches to specific collections.
Content type and format filters (planned) — filter by page_type (code, text, spreadsheet) or source_format (.pdf, .py, .xlsx).
Hybrid search (planned) — combine vector similarity with document-level ranking.

Interface

MCP server — 13 tools for ingestion, search, sync, and document management. Works with Claude Code and Claude Desktop.
CLI — same capabilities via quarry command with Rich progress display.

Quick Start

pip install quarry-mcp

# Set up data directory, download embedding model, configure MCP clients
quarry install

# Check everything is working
quarry doctor

# Ingest documents — works locally, no cloud account needed
quarry ingest notes.md
quarry ingest src/main.py

# Search
quarry search "authentication logic"

# List indexed documents
quarry list

PDF and image OCR works locally out of the box via RapidOCR. For higher accuracy on scanned documents, configure AWS Textract — see AWS Setup below.

Installation

pip install quarry-mcp
quarry install

quarry install creates the data directory (~/.quarry/data/lancedb/), downloads the embedding model (~500MB), and configures MCP for Claude Code and Claude Desktop.

Run quarry doctor to verify your environment:

  ✓ Python version: 3.13.1
  ✓ Data directory: /Users/you/.quarry/data/lancedb
  ✓ Local OCR: RapidOCR engine OK
  ○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
  ✓ Embedding model: snowflake-arctic-embed-m-v1.5 cached
  ✓ Core imports: 8 modules OK

OCR Backends

Quarry ships with two OCR backends:

Backend	Set via	Speed	Quality	Setup
local (default)	`OCR_BACKEND=local`	~7-8s/page	Good — reads names, amounts, dates, descriptions accurately. Minor artifacts: occasional fullwidth punctuation, spacing inconsistencies on dense forms.	None
textract	`OCR_BACKEND=textract`	~2-3s/page	Excellent — production-grade character accuracy	AWS credentials + S3 bucket

The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). In testing on a 6-page scanned boatyard invoice, it extracted task descriptions, dollar amounts, labor hours, and addresses accurately enough for semantic search.

AWS Setup

Optional — only needed if you want Textract OCR quality. Your IAM user needs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:StartDocumentTextDetection",
        "textract:GetDocumentTextDetection"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::your-bucket/*"
    }
  ]
}

Set your S3 bucket:

export S3_BUCKET=your-bucket-name

Usage

MCP Server

quarry install configures both Claude Code and Claude Desktop automatically. To configure manually:

Claude Code:

claude mcp add quarry -- uvx --from quarry-mcp quarry mcp

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "quarry": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"]
    }
  }
}

Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx) since Desktop has a limited PATH. quarry install resolves this automatically.

MCP Tools

Tool	Description
`search_documents`	Semantic search across all indexed documents
`ingest`	Ingest a file (PDF, image, text, source code)
`ingest_text`	Index raw text content directly (for uploads or pasted text)
`get_documents`	List all indexed documents with metadata
`get_page`	Retrieve full text for a specific page
`delete_document`	Remove a document and all its chunks
`delete_collection`	Remove all documents in a collection
`list_collections`	List all collections with document and chunk counts
`register_directory`	Register a directory for incremental sync
`deregister_directory`	Remove a directory registration
`sync_all_registrations`	Sync all registered directories (ingest new/changed, remove deleted)
`list_registrations`	List all registered directories
`status`	Database stats: document/chunk counts, registrations, storage size, model info

Claude Desktop note: Uploaded files live in a container that Quarry cannot access. For uploaded files, use ingest_text with the extracted content. For files on your Mac, provide the local path to ingest.

CLI

# Ingest documents
quarry ingest report.pdf
quarry ingest whiteboard.jpg
quarry ingest notes.md
quarry ingest report.pdf --overwrite

# Search
quarry search "board governance structure"
quarry search "quarterly revenue" -n 5

# Manage documents
quarry list
quarry delete report.pdf
quarry collections
quarry delete-collection math

# Register directories for incremental sync
quarry register /path/to/courses/ml-101 --collection ml-101
quarry register /path/to/courses/stats-200
quarry registrations
quarry sync
quarry sync --workers 8
quarry deregister ml-101

# Environment
quarry doctor
quarry install

Multiple Indices

Run separate MCP server instances with different data directories:

{
  "mcpServers": {
    "legal-docs": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/legal/lancedb" }
    },
    "financial-reports": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/financial/lancedb" }
    }
  }
}

Configuration

All settings are configurable via environment variables:

Variable	Default	Description
`OCR_BACKEND`	`local`	OCR backend: `local` (RapidOCR) or `textract` (AWS)
`AWS_ACCESS_KEY_ID`		AWS access key (textract only)
`AWS_SECRET_ACCESS_KEY`		AWS secret key (textract only)
`AWS_DEFAULT_REGION`	`us-east-1`	AWS region
`S3_BUCKET`	`ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b`	S3 bucket for Textract uploads
`LANCEDB_PATH`	`~/.quarry/data/lancedb`	Path to LanceDB storage
`EMBEDDING_MODEL`	`Snowflake/snowflake-arctic-embed-m-v1.5`	HuggingFace embedding model
`CHUNK_MAX_CHARS`	`1800`	Target max characters per chunk (~450 tokens)
`CHUNK_OVERLAP_CHARS`	`200`	Character overlap between consecutive chunks
`TEXTRACT_POLL_INITIAL`	`5.0`	Initial seconds between Textract status checks
`TEXTRACT_POLL_MAX`	`30.0`	Maximum polling interval (exponential backoff, 1.5x)
`TEXTRACT_MAX_WAIT`	`900`	Maximum seconds to wait for Textract job
`REGISTRY_PATH`	`~/.quarry/data/registry.db`	Path to directory registration SQLite database

Architecture

Connectors                Formats              Transformations
  │                         │                        │
  ├─ Local filesystem       ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
  │   (register + sync)     │            └─ image ─→ OCR (local or Textract)
  │                         │
  └─ Google Drive           ├─ Images ─────────────→ OCR (local or Textract)
     (planned)              │
                            ├─ Text files ─────────→ Section-aware splitting
                            │
                            ├─ Source code ─────────→ Tree-sitter AST splitting
                            │
                            ├─ Spreadsheets ───────→ LaTeX tabular (planned)
                            │
                            └─ Raw text ───────────→ Direct chunking
                                                         │
                                                  Indexing
                                                    │
                                                    ├─ Sentence-aware chunking
                                                    ├─ Vector embeddings
                                                    └─ LanceDB storage
                                                         │
                                                  Query
                                                    │
                                                    ├─ Semantic search
                                                    ├─ Collection filtering
                                                    └─ Format filters (planned)
                                                         │
                                                  Interface
                                                    │
                                                    ├─ MCP Server (stdio)
                                                    └─ CLI (typer + rich)

Roadmap

Formats

Spreadsheets — XLSX, XLS, CSV ingestion via LaTeX tabular serialization
Presentations — PPTX slide extraction with speaker notes
HTML — web page ingestion with structure-aware splitting
Email — EML/MBOX with header, body, and attachment extraction

Transformations

PII detection — identify and redact sensitive information before indexing

Connectors

Google Drive — cloud document source with incremental sync

Query

Search filters — filter by content type and file format for targeted retrieval
Hybrid search — combine vector similarity with document-level ranking

Development

# Run all quality gates
uv run ruff check .
uv run ruff format --check .
uv run mypy src/quarry tests
uv run pytest

The project enforces strict mypy, comprehensive ruff rules, and requires all tests to pass before every commit.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Feb 15, 2026

0.4.2

Feb 12, 2026

0.4.1

Feb 12, 2026

This version

0.4.0

Feb 12, 2026

0.3.0

Feb 10, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

0.1.3

Feb 9, 2026

0.1.2

Feb 8, 2026

0.1.1

Feb 8, 2026

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.4.0.tar.gz (35.5 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quarry_mcp-0.4.0-py3-none-any.whl (44.4 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file quarry_mcp-0.4.0.tar.gz.

File metadata

Download URL: quarry_mcp-0.4.0.tar.gz
Upload date: Feb 12, 2026
Size: 35.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`5ce176361f676cde681a82016cd5510231589183edc30720f27de39080ce7042`
MD5	`0f92fcbd1b93bff755522e89bfabc675`
BLAKE2b-256	`a9bc225514b5c4bff4b7219992a17ce5cd09b59002cb2ddef2605844fdea5154`

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.4.0-py3-none-any.whl.

File metadata

Download URL: quarry_mcp-0.4.0-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 44.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c728bd83b7635b22d2621a790e12ab63f53348f9552bb15553bd0cc0da54ac5`
MD5	`f6fe9fdf53de1bc31207995a1b3c5cc6`
BLAKE2b-256	`c63adc90ded25120d663ea1afe83fb4ad9a329981e5dd15192ff9b1bf0ef8b71`

See more details on using hashes here.

quarry-mcp 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quarry-mcp

Why Quarry?

Capabilities

Formats

Transformations

Connectors

Indexing

Query

Interface

Quick Start

Installation

OCR Backends

AWS Setup

Usage

MCP Server

MCP Tools

CLI

Multiple Indices

Configuration

Architecture

Roadmap

Formats

Transformations

Connectors

Query

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes