Extract searchable knowledge from any document. Expose it to LLMs via MCP.

These details have not been verified by PyPI

Project links

Project description

quarry-mcp

Index any document. Search with natural language. Works with Claude Code and Claude Desktop.

Quick Start

pip install quarry-mcp
quarry install          # downloads embedding model (~500MB), configures MCP
quarry ingest notes.md  # index a file — no cloud account needed
quarry search "my topic"

That's it. Quarry works locally out of the box.

What It Does

Quarry turns documents into searchable knowledge for LLMs. You feed it files, it chunks and embeds them into a local vector database, and exposes semantic search via MCP tools or a CLI.

Supported formats: PDF, images (PNG, JPG, TIFF, BMP, WebP), text files (TXT, Markdown, LaTeX, DOCX), and source code (30+ languages).

How each format is processed:

Source	What happens	Result
PDF (text pages)	Text extraction via PyMuPDF	Prose chunks
PDF (image pages)	OCR (local or cloud)	Prose chunks
Images	OCR (local or cloud)	Prose chunks
Text files	Split by headings / sections / paragraphs	Section chunks
Source code	Tree-sitter AST parsing (functions, classes)	Code chunks

Installation

pip install quarry-mcp
quarry install

quarry install creates ~/.quarry/data/, downloads the embedding model, and writes MCP config for Claude Code and Claude Desktop.

Verify with quarry doctor:

  ✓ Python version: 3.13.1
  ✓ Data directory: /Users/you/.quarry/data/lancedb
  ✓ Local OCR: RapidOCR engine OK
  ○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
  ✓ Embedding model: snowflake-arctic-embed-m-v1.5 cached
  ✓ Core imports: 8 modules OK

Usage

CLI

# Ingest
quarry ingest report.pdf
quarry ingest whiteboard.jpg
quarry ingest src/main.py
quarry ingest report.pdf --overwrite

# Search
quarry search "authentication logic"
quarry search "quarterly revenue" -n 5

# Manage
quarry list
quarry delete report.pdf
quarry collections
quarry delete-collection math

# Directory sync — register a folder, then sync to pick up changes
quarry register /path/to/docs --collection my-docs
quarry sync
quarry registrations
quarry deregister my-docs

MCP Server

quarry install configures Claude Code and Claude Desktop automatically. Manual setup:

Claude Code:

claude mcp add quarry -- uvx --from quarry-mcp quarry mcp

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "quarry": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"]
    }
  }
}

Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx). quarry install resolves this automatically.

Available tools:

Tool	Description
`search_documents`	Semantic search across indexed documents
`ingest`	Ingest a file (PDF, image, text, source code)
`ingest_text`	Index raw text content directly
`get_documents`	List indexed documents with metadata
`get_page`	Retrieve full text for a specific page
`delete_document`	Remove a document and its chunks
`delete_collection`	Remove all documents in a collection
`list_collections`	List collections with document/chunk counts
`register_directory`	Register a directory for sync
`deregister_directory`	Remove a directory registration
`sync_all_registrations`	Sync all registered directories
`list_registrations`	List registered directories
`status`	Database stats: counts, storage size, model info

Claude Desktop note: Uploaded files live in a sandbox that Quarry cannot access. Use ingest_text with extracted content for uploads. For files on your Mac, provide the local path to ingest.

Configuration

All settings via environment variables:

Variable	Default	Description
`OCR_BACKEND`	`local`	`local` (RapidOCR, offline) or `textract` (AWS)
`LANCEDB_PATH`	`~/.quarry/data/lancedb`	Vector database location
`CHUNK_MAX_CHARS`	`1800`	Target max characters per chunk (~450 tokens)
`CHUNK_OVERLAP_CHARS`	`200`	Overlap between consecutive chunks

OCR Backends

Quarry ships with two OCR backends:

Backend	Speed	Quality	Setup
local (default)	~7-8s/page	Good for semantic search	None
textract	~2-3s/page	Excellent character accuracy	AWS credentials + S3 bucket

The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). No cloud account needed.

AWS Textract Setup

Only needed if you want cloud OCR. Set these environment variables:

Variable	Default	Description
`AWS_ACCESS_KEY_ID`		AWS access key
`AWS_SECRET_ACCESS_KEY`		AWS secret key
`AWS_DEFAULT_REGION`	`us-east-1`	AWS region
`S3_BUCKET`	`ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b`	S3 bucket for Textract uploads

Your IAM user needs textract:DetectDocumentText, textract:StartDocumentTextDetection, textract:GetDocumentTextDetection, and s3:PutObject/GetObject/DeleteObject on your bucket.

Multiple Indices

Run separate MCP instances with different data directories:

{
  "mcpServers": {
    "legal-docs": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/legal/lancedb" }
    },
    "financial-reports": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/financial/lancedb" }
    }
  }
}

Advanced Configuration

Variable	Default	Description
`TEXTRACT_POLL_INITIAL`	`5.0`	Initial Textract polling interval (seconds)
`TEXTRACT_POLL_MAX`	`30.0`	Max polling interval (1.5x exponential backoff)
`TEXTRACT_MAX_WAIT`	`900`	Max wait for Textract job (seconds)
`REGISTRY_PATH`	`~/.quarry/data/registry.db`	Directory sync SQLite database

Architecture

Connectors                Formats              Transformations
  │                         │                        │
  ├─ Local filesystem       ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
  │   (register + sync)     │            └─ image ─→ OCR (local or Textract)
  │                         │
  └─ Google Drive           ├─ Images ─────────────→ OCR (local or Textract)
     (planned)              │
                            ├─ Text files ─────────→ Section-aware splitting
                            │
                            ├─ Source code ─────────→ Tree-sitter AST splitting
                            │
                            └─ Raw text ───────────→ Direct chunking
                                                         │
                                                  Indexing
                                                    │
                                                    ├─ Sentence-aware chunking
                                                    ├─ Vector embeddings (768-dim)
                                                    └─ LanceDB storage
                                                         │
                                                  Query
                                                    │
                                                    ├─ Semantic search
                                                    └─ Collection filtering
                                                         │
                                                  Interface
                                                    │
                                                    ├─ MCP Server (stdio)
                                                    └─ CLI (typer + rich)

Roadmap

Spreadsheets (XLSX, CSV) via tabular serialization
Presentations (PPTX) with speaker notes
HTML with structure-aware splitting
Search filters by content type and file format
Google Drive connector

Development

uv run ruff check .
uv run ruff format --check .
uv run mypy src/ tests/
uv run pytest

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Feb 15, 2026

This version

0.4.2

Feb 12, 2026

0.4.1

Feb 12, 2026

0.4.0

Feb 12, 2026

0.3.0

Feb 10, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

0.1.3

Feb 9, 2026

0.1.2

Feb 8, 2026

0.1.1

Feb 8, 2026

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.4.2.tar.gz (33.9 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quarry_mcp-0.4.2-py3-none-any.whl (42.8 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file quarry_mcp-0.4.2.tar.gz.

File metadata

Download URL: quarry_mcp-0.4.2.tar.gz
Upload date: Feb 12, 2026
Size: 33.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`4728289966c41ea571d90c41240b7a5bdf169694e32a9cf493ebd9ae1c6a065c`
MD5	`09cb5f14912286610e9fe4bc42312279`
BLAKE2b-256	`0fced1ab046ae276995b6641f42039658b3e665eb807429c4a8aab83cd76d33e`

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.4.2-py3-none-any.whl.

File metadata

Download URL: quarry_mcp-0.4.2-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 42.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0b681e98463dbaa82b236c209af29b0750d1cfc75812d2170a2b1a552ee9858`
MD5	`1470740d0016d6cbf669892e9fcc451d`
BLAKE2b-256	`2ee33ad77beeaa6928419f37508a504eb6a2c6522bd82ebeaa6da853713ab237`

See more details on using hashes here.

quarry-mcp 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quarry-mcp

Quick Start

What It Does

Installation

Usage

CLI

MCP Server

Configuration

OCR Backends

AWS Textract Setup

Multiple Indices

Advanced Configuration

Architecture

Roadmap

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes