Skip to main content

Smart document-to-Markdown conversion for AI agents

Project description

๐Ÿ‡จ๐Ÿ‡ณ ไธญๆ–‡

shuck-file

Feed any document to your AI agent โ€” in one command.

shuck-file converts documents to clean Markdown for AI agents and LLMs. Small files output directly; large files return a document map with section summaries, token counts, and actionable next steps โ€” so agents only pull what they need.

Why shuck-file?

AI agents can't read binary documents. They need a bridge that's context-aware:

  • Small file โ†’ shuck report.docx โ†’ full Markdown on stdout
  • Large file โ†’ shuck report.docx โ†’ document map with sections and extraction options
  • Targeted extraction โ†’ shuck report.docx --sections s1,s3 โ†’ only what you need
  • Search โ†’ shuck report.docx --grep "revenue" โ†’ find without reading everything

Supported Formats

Format Extension Library What's Preserved
Word .docx python-docx Headings, bold/italic, lists, tables
PDF .pdf pdfplumber Text content, page breaks
Excel .xlsx openpyxl All sheets as Markdown tables
PowerPoint .pptx python-pptx Titles, text, tables, speaker notes
CSV .csv stdlib All rows/columns as a table

Installation

Via pip (recommended)

pip install shuck-file

This installs the shuck CLI command and the MCP server.

From source

git clone https://github.com/Shan-Zhu/shuck-file.git
cd shuck-file
pip install -e .

Quick Start

# Convert a document
shuck report.docx

# Force full output (bypass map mode)
shuck large-report.pdf --all

# Search within a document
shuck report.pdf --grep "revenue"

Usage

Auto-Routing (default)

Small files output directly, large files return a document map.

# Small file โ†’ direct Markdown output
shuck document.pdf

# Large file โ†’ document map with sections table + next steps
shuck large-report.pdf

Extraction Options

# Force full output (bypass map mode)
shuck report.pdf --all

# Extract specific sections
shuck report.pdf --sections s1,s3

# Tables only
shuck report.pdf --tables-only

# Search within document
shuck report.pdf --grep "revenue"

# Token budget (smart compression)
shuck report.pdf --budget 4000

# Combinations work
shuck report.pdf --sections s2,s3 --budget 2000

Excel/CSV Specific

# Column headers and types
shuck data.xlsx --schema-only

# Headers + first N rows
shuck data.xlsx --sample 5

Power User Subcommands

# Force map mode (even on small files)
shuck probe document.docx

# Force full extraction (alias for --all)
shuck pull document.docx

Output Control

# Write to file
shuck document.pdf -o output.md

# Write to directory (auto-named)
shuck document.pdf -d ./converted/

# Skip YAML frontmatter
shuck document.pdf --no-frontmatter

# List supported formats
shuck --formats

Map Mode Output

When a file is large, shuck returns a document map:

# Document Map: quarterly-report.pdf

**6 pages | ~12,400 tokens | 6 sections**

## Sections

| # | Title | Type | Tokens | Density |
|---|-------|------|--------|---------|
| s1 | Executive Summary | narrative | 450 | high |
| s2 | Q3 Financial Results | mixed | 2,800 | high |
| s3 | Revenue Breakdown | tabular | 3,200 | high |
| ...

## Next Steps

- `shuck quarterly-report.pdf --all` -- full document (~12,400 tokens)
- `shuck quarterly-report.pdf --sections s1,s2` -- high-density (~3,250 tokens)
- `shuck quarterly-report.pdf --grep "..."` -- search for keywords

MCP Server

shuck-file includes an MCP (Model Context Protocol) server, making it available to any MCP-compatible AI tool.

Claude Code

claude mcp add shuck-file -- shuck-file

Or add to your project's .mcp.json:

{
  "mcpServers": {
    "shuck-file": {
      "command": "shuck-file",
      "args": []
    }
  }
}

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "shuck-file": {
      "command": "shuck-file",
      "args": []
    }
  }
}

Windsurf

Add to your MCP configuration:

{
  "mcpServers": {
    "shuck-file": {
      "command": "shuck-file",
      "args": []
    }
  }
}

Any MCP Client

shuck-file registers as an MCP server via the mcp.servers entry point. Tools exposed:

  • shuck โ€” Convert a document to Markdown with all options (mode, sections, grep, budget, etc.)
  • list_formats โ€” List supported document formats

Claude Code Plugin

Install as a Claude Code plugin for the /shuck skill:

claude plugin add /path/to/shuck-file

Architecture

src/shuck_file/
โ”œโ”€โ”€ cli.py                # CLI entrypoint
โ”œโ”€โ”€ server.py             # MCP Server (FastMCP)
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ router.py          # Auto-routing logic
โ”‚   โ”œโ”€โ”€ segmenter.py       # Document segmentation
โ”‚   โ”œโ”€โ”€ mapper.py          # Map mode renderer
โ”‚   โ”œโ”€โ”€ budget.py          # Smart compression
โ”‚   โ”œโ”€โ”€ grep.py            # In-document search
โ”‚   โ”œโ”€โ”€ frontmatter.py     # YAML frontmatter
โ”‚   โ””โ”€โ”€ models.py          # Data models
โ”œโ”€โ”€ extractors/
โ”‚   โ”œโ”€โ”€ base.py            # Base extractor ABC
โ”‚   โ”œโ”€โ”€ docx_ext.py        # Word extractor
โ”‚   โ”œโ”€โ”€ pdf_ext.py         # PDF extractor
โ”‚   โ”œโ”€โ”€ xlsx_ext.py        # Excel extractor
โ”‚   โ”œโ”€โ”€ pptx_ext.py        # PowerPoint extractor
โ”‚   โ””โ”€โ”€ csv_ext.py         # CSV extractor
plugin/                    # Claude Code plugin wrapper
tests/
โ”œโ”€โ”€ test_extractors.py
โ”œโ”€โ”€ test_router.py
โ”œโ”€โ”€ test_segmenter.py
โ”œโ”€โ”€ test_budget.py
โ””โ”€โ”€ test_grep.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shuck_file-2.0.0.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shuck_file-2.0.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file shuck_file-2.0.0.tar.gz.

File metadata

  • Download URL: shuck_file-2.0.0.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for shuck_file-2.0.0.tar.gz
Algorithm Hash digest
SHA256 cfbe2385ca713229958ddf1f19d7809882b93134427ebc8ea78a1074ed0d6187
MD5 68b656db34f29e2e384d2087795b20de
BLAKE2b-256 6d22deba7fbbfd83e1b2dbc2d30aa79375069803de6e26e1eda481e96396d9a7

See more details on using hashes here.

File details

Details for the file shuck_file-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: shuck_file-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for shuck_file-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f65f6009098c2df397c00f4c5d3a795fb44098a67fc7f0edb4e619f6a75e242
MD5 9070fad8319e952b14815fa42387f82f
BLAKE2b-256 4dd1059831abd4bf2ed494e395863969c6465973ef61d68bd2cf29197221318c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page