Extract searchable knowledge from any document. Expose it to LLMs via MCP.

These details have not been verified by PyPI

Project links

Project description

quarry-mcp

Extract searchable knowledge from any document. Expose it to LLMs via MCP.

Quarry ingests PDFs, images, text files, and audio into a local vector database, then serves semantic search over that content through the Model Context Protocol. Point Claude Code or Claude Desktop at your documents and ask questions.

Why Quarry?

If your documents are already machine-readable text (TXT, Markdown, DOCX), mcp-local-rag is a solid zero-config option — one npx command and you're searching.

Quarry exists for documents that aren't text yet:

Scanned PDFs — Board packs, legal filings, archival records. No embedded text, just page images. Quarry classifies each page, routes image pages through AWS Textract OCR, and extracts text from the rest.
Mixed-format PDFs — Some pages are text, some are scans. Quarry handles both in a single pipeline.
Images — Photos of whiteboards, receipts, handwritten notes. (Planned: Epic 3)
Audio — Meeting recordings, interviews, podcasts. (Planned: Epic 4)

Quarry also preserves full page text alongside chunks, so LLMs can reference surrounding context when a search hit lands mid-page.

Features

PDF ingestion with automatic text/image classification per page
OCR via AWS Textract for scanned and image-based documents
Text extraction via PyMuPDF for text-based PDF pages
Sentence-aware chunking with configurable overlap
Local vector embeddings using snowflake-arctic-embed-m-v1.5 (768-dim)
LanceDB for fast, local vector storage (no external database)
MCP server with 4 tools: search_documents, ingest, get_documents, get_page
CLI for ingestion, search, and document management
Full page text preserved alongside chunks for LLM reference

Quick Start

# Clone and install
git clone https://github.com/jmf-pobox/quarry-mcp.git
cd quarry-mcp
uv sync

# Configure AWS credentials (required for OCR)
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=us-east-1

# Ingest a PDF
uv run quarry ingest /path/to/document.pdf

# Search
uv run quarry search "revenue growth in 2024"

# List indexed documents
uv run quarry list

Installation

Requires Python 3.13+ and uv.

uv sync

For development:

uv pip install -e ".[dev]"

AWS Setup

Quarry uses AWS Textract for OCR. Your IAM user needs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:StartDocumentTextDetection",
        "textract:GetDocumentTextDetection"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::your-bucket/*"
    }
  ]
}

Set your S3 bucket via environment variable or .env file:

export S3_BUCKET=your-bucket-name

Usage

MCP Server (Claude Code)

Add to your Claude Code configuration:

claude mcp add quarry -- uv run --directory /path/to/quarry-mcp python -m quarry mcp

After restarting Claude Code, four tools are available:

Tool	Description
`search_documents`	Semantic search across all indexed documents
`ingest`	OCR and index a new PDF
`get_documents`	List all indexed documents with metadata
`get_page`	Retrieve full OCR text for a specific page

MCP Server (Claude Desktop)

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "quarry": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/quarry-mcp", "python", "-m", "ocr", "mcp"],
      "env": {
        "AWS_ACCESS_KEY_ID": "your-key",
        "AWS_SECRET_ACCESS_KEY": "your-secret",
        "AWS_DEFAULT_REGION": "us-east-1",
        "S3_BUCKET": "your-bucket"
      }
    }
  }
}

CLI

# Ingest a document
uv run quarry ingest report.pdf

# Re-ingest (overwrite existing)
uv run quarry ingest report.pdf --overwrite

# Search across all documents
uv run quarry search "board governance structure"

# Search with result limit
uv run quarry search "quarterly revenue" -n 5

# List indexed documents
uv run quarry list

Multiple Indices

Run separate MCP server instances with different data directories:

{
  "mcpServers": {
    "legal-docs": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/quarry-mcp", "python", "-m", "ocr", "mcp"],
      "env": { "LANCEDB_PATH": "/data/legal/lancedb" }
    },
    "financial-reports": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/quarry-mcp", "python", "-m", "ocr", "mcp"],
      "env": { "LANCEDB_PATH": "/data/financial/lancedb" }
    }
  }
}

Configuration

All settings are configurable via environment variables or a .env file:

Variable	Default	Description
`AWS_ACCESS_KEY_ID`		AWS access key
`AWS_SECRET_ACCESS_KEY`		AWS secret key
`AWS_DEFAULT_REGION`	`us-east-1`	AWS region
`S3_BUCKET`	`ocr-7f3a1b2e4c5d4e8f9a1b3c5d7e9f2a4b`	S3 bucket for Textract uploads
`LANCEDB_PATH`	`./data/lancedb`	Path to LanceDB storage
`EMBEDDING_MODEL`	`Snowflake/snowflake-arctic-embed-m-v1.5`	HuggingFace embedding model
`CHUNK_MAX_CHARS`	`1800`	Target max characters per chunk (~450 tokens)
`CHUNK_OVERLAP_CHARS`	`200`	Character overlap between consecutive chunks
`TEXTRACT_POLL_INTERVAL`	`5`	Seconds between Textract status checks
`TEXTRACT_MAX_WAIT`	`900`	Maximum seconds to wait for Textract job

Architecture

Input (PDF)
  │
  ├─ Text pages ──→ PyMuPDF text extraction
  │                        │
  └─ Image pages ─→ S3 upload → Textract async OCR → parse → S3 cleanup
                           │
                     Page contents
                           │
                     Sentence-aware chunking (with overlap)
                           │
                     snowflake-arctic-embed-m-v1.5
                           │
                     LanceDB (local vector store)
                           │
                  ┌────────┴────────┐
                  │                 │
              MCP Server         CLI
          (stdio transport)   (typer + rich)

Each chunk stores both its text fragment and the full page raw text, so LLMs can reference surrounding context when a search result is relevant.

Roadmap

Epic 1: PDF Pipeline ✓

Core ingestion and search for PDF documents.

PDF page analysis (text vs image classification)
Text extraction via PyMuPDF
OCR via AWS Textract (async API with polling)
Sentence-aware chunking with configurable overlap
Local vector embeddings (snowflake-arctic-embed-m-v1.5)
LanceDB vector storage with PyArrow schema
MCP server with search, ingest, list, and page retrieval
CLI with progress display
Test suite (62 tests across 9 modules)

Epic 2: Text Document Ingestion

Direct ingestion of text-based formats without OCR.

Plain text files (.txt)
Markdown (.md)
LaTeX (.tex)
DOCX
String ingestion (raw text/markdown/HTML without a file)
Configurable page/section boundary detection

Epic 3: Image Format Support

OCR for standalone image files (not wrapped in PDF).

Common formats: PNG, JPG, TIFF, BMP, WebP
Single-image and batch ingestion
Image preprocessing for OCR quality (deskew, contrast)

Epic 4: Audio Transcription

Speech-to-text ingestion for audio content.

Audio format support: MP3, WAV, M4A, FLAC
AWS Transcribe or Whisper integration
Speaker diarization
Timestamped chunks for source reference

Epic 5: Ingestion Quality

Post-processing to improve extracted text quality.

LLM-based OCR error correction
Chunk quality scoring and filtering
Duplicate and near-duplicate detection
Table and figure extraction

Epic 6: Multi-Index Management

First-class support for organizing documents into collections.

Named indices with isolated storage
Per-index configuration (embedding model, chunk size)
Cross-index search
Index metadata and statistics

Standalone Tasks

Expose delete_document via MCP and CLI
Add status tool to MCP server (document/chunk counts, DB size, model info)

Development

# Run all quality gates
uv run ruff check .
uv run ruff format --check .
uv run mypy src/quarry tests
uv run pytest

# Auto-format
uv run ruff format .

The project enforces strict mypy, comprehensive ruff rules, and requires all tests to pass before every commit.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Feb 15, 2026

0.4.2

Feb 12, 2026

0.4.1

Feb 12, 2026

0.4.0

Feb 12, 2026

0.3.0

Feb 10, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

0.1.3

Feb 9, 2026

0.1.2

Feb 8, 2026

This version

0.1.1

Feb 8, 2026

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quarry_mcp-0.1.1.tar.gz (20.5 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quarry_mcp-0.1.1-py3-none-any.whl (26.9 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file quarry_mcp-0.1.1.tar.gz.

File metadata

Download URL: quarry_mcp-0.1.1.tar.gz
Upload date: Feb 8, 2026
Size: 20.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`73e2f902ba187bba8e953ddf379ed343bc0a8283a90e5777bb623063b8229a40`
MD5	`95940d886938884a9ed2b7cd86a3de12`
BLAKE2b-256	`d664e322ca6c95e57dbbeae136df6ddff7a0f851a25c372ef1874c4daec632b9`

See more details on using hashes here.

File details

Details for the file quarry_mcp-0.1.1-py3-none-any.whl.

File metadata

Download URL: quarry_mcp-0.1.1-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quarry_mcp-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e967fc741a6754f90d84bb32905e3cee7a56f5cd276200cfb60f91a0d6027f1f`
MD5	`72de3403309ed65dd96018375cb5663f`
BLAKE2b-256	`90baaab61db8cee3b434974faead73d26f28c35fe6a402a6a64486f493e07213`

See more details on using hashes here.

quarry-mcp 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quarry-mcp

Why Quarry?

Features

Quick Start

Installation

AWS Setup

Usage

MCP Server (Claude Code)

MCP Server (Claude Desktop)

CLI

Multiple Indices

Configuration

Architecture

Roadmap

Epic 1: PDF Pipeline ✓

Epic 2: Text Document Ingestion

Epic 3: Image Format Support

Epic 4: Audio Transcription

Epic 5: Ingestion Quality

Epic 6: Multi-Index Management

Standalone Tasks

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes