Skip to main content

Extract searchable knowledge from any document. Expose it to LLMs via MCP.

Project description

quarry-mcp

PyPI GitHub release Python 3.13+ Tests Lint

Index any document. Search with natural language. Works with Claude Code and Claude Desktop.

Quick Start

pip install quarry-mcp
quarry install          # downloads embedding model (~500MB), configures MCP
quarry ingest notes.md  # index a file — no cloud account needed
quarry search "my topic"

That's it. Quarry works locally out of the box.

What It Does

Quarry turns documents into searchable knowledge for LLMs. You feed it files, it chunks and embeds them into a local vector database, and exposes semantic search via MCP tools or a CLI.

Supported formats: PDF, images (PNG, JPG, TIFF, BMP, WebP), text files (TXT, Markdown, LaTeX, DOCX), and source code (30+ languages).

How each format is processed:

Source What happens Result
PDF (text pages) Text extraction via PyMuPDF Prose chunks
PDF (image pages) OCR (local or cloud) Prose chunks
Images OCR (local or cloud) Prose chunks
Text files Split by headings / sections / paragraphs Section chunks
Source code Tree-sitter AST parsing (functions, classes) Code chunks

Installation

pip install quarry-mcp
quarry install

quarry install creates ~/.quarry/data/, downloads the embedding model, and writes MCP config for Claude Code and Claude Desktop.

Verify with quarry doctor:

  ✓ Python version: 3.13.1
  ✓ Data directory: /Users/you/.quarry/data/lancedb
  ✓ Local OCR: RapidOCR engine OK
  ○ AWS credentials: Not configured (optional — needed for OCR_BACKEND=textract)
  ✓ Embedding model: snowflake-arctic-embed-m-v1.5 cached
  ✓ Core imports: 8 modules OK

Usage

CLI

# Ingest
quarry ingest report.pdf
quarry ingest whiteboard.jpg
quarry ingest src/main.py
quarry ingest report.pdf --overwrite

# Search
quarry search "authentication logic"
quarry search "quarterly revenue" -n 5

# Manage
quarry list
quarry delete report.pdf
quarry collections
quarry delete-collection math

# Directory sync — register a folder, then sync to pick up changes
quarry register /path/to/docs --collection my-docs
quarry sync
quarry registrations
quarry deregister my-docs

MCP Server

quarry install configures Claude Code and Claude Desktop automatically. Manual setup:

Claude Code:

claude mcp add quarry -- uvx --from quarry-mcp quarry mcp

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "quarry": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"]
    }
  }
}

Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx). quarry install resolves this automatically.

Available tools:

Tool Description
search_documents Semantic search across indexed documents
ingest Ingest a file (PDF, image, text, source code)
ingest_text Index raw text content directly
get_documents List indexed documents with metadata
get_page Retrieve full text for a specific page
delete_document Remove a document and its chunks
delete_collection Remove all documents in a collection
list_collections List collections with document/chunk counts
register_directory Register a directory for sync
deregister_directory Remove a directory registration
sync_all_registrations Sync all registered directories
list_registrations List registered directories
status Database stats: counts, storage size, model info

Claude Desktop note: Uploaded files live in a sandbox that Quarry cannot access. Use ingest_text with extracted content for uploads. For files on your Mac, provide the local path to ingest.

Configuration

All settings via environment variables:

Variable Default Description
OCR_BACKEND local local (RapidOCR, offline) or textract (AWS)
LANCEDB_PATH ~/.quarry/data/lancedb Vector database location
CHUNK_MAX_CHARS 1800 Target max characters per chunk (~450 tokens)
CHUNK_OVERLAP_CHARS 200 Overlap between consecutive chunks

OCR Backends

Quarry ships with two OCR backends:

Backend Speed Quality Setup
local (default) ~7-8s/page Good for semantic search None
textract ~2-3s/page Excellent character accuracy AWS credentials + S3 bucket

The local backend uses RapidOCR (PaddleOCR models via ONNX Runtime, CPU-only, ~214 MB). No cloud account needed.

AWS Textract Setup

Only needed if you want cloud OCR. Set these environment variables:

Variable Default Description
AWS_ACCESS_KEY_ID AWS access key
AWS_SECRET_ACCESS_KEY AWS secret key
AWS_DEFAULT_REGION us-east-1 AWS region
S3_BUCKET ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b S3 bucket for Textract uploads

Your IAM user needs textract:DetectDocumentText, textract:StartDocumentTextDetection, textract:GetDocumentTextDetection, and s3:PutObject/GetObject/DeleteObject on your bucket.

Multiple Indices

Run separate MCP instances with different data directories:

{
  "mcpServers": {
    "legal-docs": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/legal/lancedb" }
    },
    "financial-reports": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/financial/lancedb" }
    }
  }
}

Advanced Configuration

Variable Default Description
TEXTRACT_POLL_INITIAL 5.0 Initial Textract polling interval (seconds)
TEXTRACT_POLL_MAX 30.0 Max polling interval (1.5x exponential backoff)
TEXTRACT_MAX_WAIT 900 Max wait for Textract job (seconds)
REGISTRY_PATH ~/.quarry/data/registry.db Directory sync SQLite database

Architecture

Connectors                Formats              Transformations
  │                         │                        │
  ├─ Local filesystem       ├─ PDF ──────┬─ text ──→ PyMuPDF extraction
  │   (register + sync)     │            └─ image ─→ OCR (local or Textract)
  │                         │
  └─ Google Drive           ├─ Images ─────────────→ OCR (local or Textract)
     (planned)              │
                            ├─ Text files ─────────→ Section-aware splitting
                            │
                            ├─ Source code ─────────→ Tree-sitter AST splitting
                            │
                            └─ Raw text ───────────→ Direct chunking
                                                         │
                                                  Indexing
                                                    │
                                                    ├─ Sentence-aware chunking
                                                    ├─ Vector embeddings (768-dim)
                                                    └─ LanceDB storage
                                                         │
                                                  Query
                                                    │
                                                    ├─ Semantic search
                                                    └─ Collection filtering
                                                         │
                                                  Interface
                                                    │
                                                    ├─ MCP Server (stdio)
                                                    └─ CLI (typer + rich)

Roadmap

  • Spreadsheets (XLSX, CSV) via tabular serialization
  • Presentations (PPTX) with speaker notes
  • HTML with structure-aware splitting
  • Search filters by content type and file format
  • Google Drive connector

Development

uv run ruff check .
uv run ruff format --check .
uv run mypy src/ tests/
uv run pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.4.2.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quarry_mcp-0.4.2-py3-none-any.whl (42.8 kB view details)

Uploaded Python 3

File details

Details for the file quarry_mcp-0.4.2.tar.gz.

File metadata

  • Download URL: quarry_mcp-0.4.2.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.4.2.tar.gz
Algorithm Hash digest
SHA256 4728289966c41ea571d90c41240b7a5bdf169694e32a9cf493ebd9ae1c6a065c
MD5 09cb5f14912286610e9fe4bc42312279
BLAKE2b-256 0fced1ab046ae276995b6641f42039658b3e665eb807429c4a8aab83cd76d33e

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: quarry_mcp-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c0b681e98463dbaa82b236c209af29b0750d1cfc75812d2170a2b1a552ee9858
MD5 1470740d0016d6cbf669892e9fcc451d
BLAKE2b-256 2ee33ad77beeaa6928419f37508a504eb6a2c6522bd82ebeaa6da853713ab237

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page