Extract searchable knowledge from any document. Expose it to LLMs via MCP.

These details have not been verified by PyPI

Project links

Project description

quarry-mcp

Extract searchable knowledge from any document. Expose it to LLMs via MCP.

Quarry ingests PDFs, images, text files, source code, and raw text into a local vector database, then serves semantic search over that content through the Model Context Protocol. Point Claude Code or Claude Desktop at your documents and ask questions.

Why Quarry?

If your documents are already machine-readable text (TXT, Markdown, DOCX), mcp-local-rag is a solid zero-config option — one npx command and you're searching.

Quarry exists for documents that aren't text yet:

Scanned PDFs — Board packs, legal filings, archival records. No embedded text, just page images. Quarry classifies each page, routes image pages through AWS Textract OCR, and extracts text from the rest.
Mixed-format PDFs — Some pages are text, some are scans. Quarry handles both in a single pipeline.
Images — Photos of whiteboards, receipts, handwritten notes. PNG, JPG, TIFF (multi-page), BMP, WebP.
Text files — TXT, Markdown, LaTeX, DOCX. No OCR needed, straight to chunking.
Source code — Python, JavaScript, TypeScript, Rust, Go, Java, C/C++, and 20+ more languages. Tree-sitter splits code into semantic sections (functions, classes, imports).
Raw text — Paste content directly via ingest_text. Use this from Claude Desktop for uploaded files.

Quarry also preserves full page text alongside chunks, so LLMs can reference surrounding context when a search hit lands mid-page.

Features

PDF ingestion with automatic text/image classification per page
Image ingestion — PNG, JPG, TIFF (multi-page), BMP, WebP via Textract OCR
Text file ingestion — TXT, Markdown, LaTeX, DOCX
Source code ingestion — 30+ languages via tree-sitter AST splitting (Python, JS/TS, Rust, Go, Java, C/C++, Ruby, Swift, Kotlin, and more)
Raw text ingestion — ingest content directly without a file on disk
OCR via AWS Textract for scanned and image-based documents
Text extraction via PyMuPDF for text-based PDF pages
Sentence-aware chunking with configurable overlap
Local vector embeddings using snowflake-arctic-embed-m-v1.5 (768-dim)
LanceDB for fast, local vector storage (no external database)
Directory registration and incremental sync — register directories, detect new/changed/deleted files via mtime+size, re-index in parallel
MCP server with 13 tools: search_documents, ingest, ingest_text, get_documents, get_page, delete_document, delete_collection, list_collections, register_directory, deregister_directory, sync_all_registrations, list_registrations, status
CLI for ingestion, search, document management, directory registration, and sync
Full page text preserved alongside chunks for LLM reference

Quick Start

pip install quarry-mcp

# Set up data directory, download embedding model, configure MCP clients
quarry install

# Check everything is working
quarry doctor

# Configure AWS credentials (required for OCR)
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=us-east-1

# Ingest a PDF
quarry ingest /path/to/document.pdf

# Search
quarry search "revenue growth in 2024"

# List indexed documents
quarry list

Installation

pip install quarry-mcp
quarry install

quarry install creates the data directory (~/.quarry/data/lancedb/), downloads the embedding model (~500MB), and configures MCP for Claude Code and Claude Desktop.

Run quarry doctor to verify your environment:

  ✓ Python version: 3.13.1
  ✓ Data directory: /Users/you/.quarry/data/lancedb
  ✓ AWS credentials: AKIA****YMUH (via shared-credentials-file)
  ✓ Embedding model: snowflake-arctic-embed-m-v1.5 cached
  ✓ Core imports: 5 modules OK

AWS Setup

Quarry uses AWS Textract for OCR. Your IAM user needs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:StartDocumentTextDetection",
        "textract:GetDocumentTextDetection"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::your-bucket/*"
    }
  ]
}

Set your S3 bucket:

export S3_BUCKET=your-bucket-name

Usage

MCP Server

quarry install configures both Claude Code and Claude Desktop automatically. To configure manually:

Claude Code:

claude mcp add quarry -- uvx --from quarry-mcp quarry mcp

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "quarry": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"]
    }
  }
}

Use the absolute path to uvx for Desktop (e.g. /opt/homebrew/bin/uvx) since Desktop has a limited PATH. quarry install resolves this automatically.

MCP Tools

Tool	Description
`search_documents`	Semantic search across all indexed documents
`ingest`	OCR and index a file (PDF, image, TXT, MD, TEX, DOCX, source code)
`ingest_text`	Index raw text content directly (for uploads or pasted text)
`get_documents`	List all indexed documents with metadata
`get_page`	Retrieve full text for a specific page
`delete_document`	Remove a document and all its chunks
`delete_collection`	Remove all documents in a collection
`list_collections`	List all collections with document and chunk counts
`register_directory`	Register a directory for incremental sync
`deregister_directory`	Remove a directory registration
`sync_all_registrations`	Sync all registered directories (ingest new/changed, remove deleted)
`list_registrations`	List all registered directories
`status`	Database stats: document/chunk counts, registrations, storage size, model info

Claude Desktop note: Uploaded files live in a container that Quarry cannot access. For uploaded files, use ingest_text with the extracted content. For files on your Mac, provide the local path to ingest.

CLI

# Ingest documents
quarry ingest report.pdf
quarry ingest whiteboard.jpg
quarry ingest notes.md
quarry ingest report.pdf --overwrite

# Search
quarry search "board governance structure"
quarry search "quarterly revenue" -n 5

# Manage documents
quarry list
quarry delete report.pdf
quarry collections
quarry delete-collection math

# Register directories for incremental sync
quarry register /path/to/courses/ml-101 --collection ml-101
quarry register /path/to/courses/stats-200
quarry registrations
quarry sync
quarry sync --workers 8
quarry deregister ml-101

# Environment
quarry doctor
quarry install

Multiple Indices

Run separate MCP server instances with different data directories:

{
  "mcpServers": {
    "legal-docs": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/legal/lancedb" }
    },
    "financial-reports": {
      "command": "/path/to/uvx",
      "args": ["--from", "quarry-mcp", "quarry", "mcp"],
      "env": { "LANCEDB_PATH": "/data/financial/lancedb" }
    }
  }
}

Configuration

All settings are configurable via environment variables:

Variable	Default	Description
`AWS_ACCESS_KEY_ID`		AWS access key
`AWS_SECRET_ACCESS_KEY`		AWS secret key
`AWS_DEFAULT_REGION`	`us-east-1`	AWS region
`S3_BUCKET`	`ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b`	S3 bucket for Textract uploads
`LANCEDB_PATH`	`~/.quarry/data/lancedb`	Path to LanceDB storage
`EMBEDDING_MODEL`	`Snowflake/snowflake-arctic-embed-m-v1.5`	HuggingFace embedding model
`CHUNK_MAX_CHARS`	`1800`	Target max characters per chunk (~450 tokens)
`CHUNK_OVERLAP_CHARS`	`200`	Character overlap between consecutive chunks
`TEXTRACT_POLL_INITIAL`	`5.0`	Initial seconds between Textract status checks
`TEXTRACT_POLL_MAX`	`30.0`	Maximum polling interval (exponential backoff, 1.5x)
`TEXTRACT_MAX_WAIT`	`900`	Maximum seconds to wait for Textract job
`REGISTRY_PATH`	`~/.quarry/data/registry.db`	Path to directory registration SQLite database

Architecture

Input
  │
  ├─ PDF ─────────┬─ Text pages ──→ PyMuPDF extraction
  │               └─ Image pages ─→ S3 → Textract async OCR → S3 cleanup
  │
  ├─ Images ──────→ Textract sync OCR (BMP/WebP converted to PNG)
  │                 TIFF multi-page → S3 → Textract async OCR
  │
  ├─ Text files ──→ Direct text extraction (TXT, MD, TEX, DOCX)
  │
  ├─ Source code ─→ Tree-sitter AST splitting (30+ languages)
  │
  └─ Raw text ────→ ingest_text (from uploads, clipboard, etc.)
                          │
                    Sentence-aware chunking (with overlap)
                          │
                    snowflake-arctic-embed-m-v1.5
                          │
                    LanceDB (local vector store)
                          │
                 ┌────────┴────────┐
                 │                 │
             MCP Server         CLI
         (stdio transport)   (typer + rich)

Incremental Sync
  │
  Directory Registry (SQLite, WAL mode)
  │
  ├─ register → track directory + collection mapping
  ├─ sync ────→ walk directory, compare mtime+size
  │              ├─ new/changed → ThreadPoolExecutor → ingest pipeline
  │              ├─ unchanged  → skip
  │              └─ deleted    → remove from LanceDB + registry
  └─ deregister → remove tracking + optionally clean data

Each chunk stores both its text fragment and the full page raw text, so LLMs can reference surrounding context when a search result is relevant.

Development

# Run all quality gates
uv run ruff check .
uv run ruff format --check .
uv run mypy src/quarry tests
uv run pytest

The project enforces strict mypy, comprehensive ruff rules, and requires all tests to pass before every commit.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Feb 15, 2026

0.4.2

Feb 12, 2026

0.4.1

Feb 12, 2026

0.4.0

Feb 12, 2026

This version

0.3.0

Feb 10, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

0.1.3

Feb 9, 2026

0.1.2

Feb 8, 2026

0.1.1

Feb 8, 2026

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.3.0.tar.gz (33.1 kB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quarry_mcp-0.3.0-py3-none-any.whl (41.3 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file quarry_mcp-0.3.0.tar.gz.

File metadata

Download URL: quarry_mcp-0.3.0.tar.gz
Upload date: Feb 10, 2026
Size: 33.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for quarry_mcp-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2e8d48498b4d11904ecf46cbe8c536b7f4f03297ab45defdd2a545ae41986283`
MD5	`d4b5ed65be94e11ab8920a9a73f2664f`
BLAKE2b-256	`35a66a8c93f39b41c97f7d60aaea5b8dd06cf198a8b670028df66b2f19d2756f`

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.3.0-py3-none-any.whl.

File metadata

Download URL: quarry_mcp-0.3.0-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 41.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for quarry_mcp-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`adbe4ae29172f2cb15ee99e488e367c2f5ff7efb058bb1b91c1e40eaf7276cd0`
MD5	`d76be9be7f6c69d9e625700889a9a458`
BLAKE2b-256	`a3d96535e3eb0d5dafcae460475c29894d87586852ee45ec158f94a1198df24f`

See more details on using hashes here.

quarry-mcp 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quarry-mcp

Why Quarry?

Features

Quick Start

Installation

AWS Setup

Usage

MCP Server

MCP Tools

CLI

Multiple Indices

Configuration

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes