Skip to main content

Lightweight MCP server + CLI for extracting text from PDFs and images. Agent-first, zero-config.

Project description

Text Extractor

PyPI version Python License

Lightweight MCP server + CLI for extracting text from PDFs and images, designed for agent workflows.

It is fast on normal digital PDFs, but can still recover text from difficult documents with quality-aware fallback and optional OCR.

Features

  • Zero-config extraction for standard PDFs
  • Smart fallback chain for low-quality or scanned pages
  • Optional OCR stack for image-heavy PDFs and image files
  • MCP tools ready for VS Code, Claude, and agent runtimes
  • CLI and Python API in one package

Install

Install the package:

pip install text-extractor-lightweight

Optional extras:

# Fast extraction via PyMuPDF (recommended — handles complex fonts, 10-20x faster)
pip install "text-extractor-lightweight[fast]"

# OCR support (scanned PDFs, images)
pip install "text-extractor-lightweight[ocr]"

# Better handling of complex layouts
pip install "text-extractor-lightweight[docling]"

# Everything
pip install "text-extractor-lightweight[all]"

System dependencies for OCR:

Windows (winget):

winget install --id tesseract-ocr.tesseract -e
winget install --id oschwartz10612.Poppler -e

CLI

The package name is text-extractor-lightweight, and it installs CLI commands text-extractor and text-extractor-mcp.

# Extract full text from a PDF
text-extractor report.pdf

# Extract from an image
text-extractor screenshot.png

# Extract only specific pages
text-extractor report.pdf --pages 1-5

# Show document metadata/summary only
text-extractor report.pdf --info

# Show which strategy would be used
text-extractor report.pdf --strategy

# Chunk large output by token budget
text-extractor large.pdf --chunk-tokens 50000

MCP Server

Run directly with uvx:

uvx --from text-extractor-lightweight text-extractor-mcp

Claude Code

claude mcp add text-extractor -- uvx --from text-extractor-lightweight text-extractor-mcp

VS Code / GitHub Copilot

Add to .vscode/mcp.json:

{
  "servers": {
    "text-extractor": {
      "command": "uvx",
      "args": ["--from", "text-extractor-lightweight", "text-extractor-mcp"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "text-extractor": {
      "command": "uvx",
      "args": ["--from", "text-extractor-lightweight", "text-extractor-mcp"]
    }
  }
}

Exposed MCP Tools

Tool Description
extract_text_from_file Extract full text from a PDF or image as markdown
extract_text_pages Extract text from a specific page range
get_document_info Get page count, type, metadata, and token estimate

Extraction Strategy

Automatic routing selects the best backend by file type and quality:

Image file        -> Tesseract OCR
Digital PDF       -> pymupdf (fast path, recommended)
                     pypdf (fallback if pymupdf not installed)
Garbled PDF text  -> pdfplumber -> docling (optional)
Scanned/image PDF -> pdf2image + Tesseract OCR

PDF fallback chain:

pymupdf -> pdfplumber -> docling -> pdf2image+OCR

PyMuPDF (pip install "text-extractor-lightweight[fast]") is the preferred PDF backend. It correctly decodes custom embedded fonts (where pypdf/pdfplumber emit garbled (cid:N) output) and is 10–20x faster than pypdf.

Python API

from text_extractor.extract import extract_text, extract_raw

# Markdown output
markdown = extract_text("report.pdf")

# Structured output
result = extract_raw("report.pdf")
print(result.total_pages, result.estimated_tokens)
for page in result.pages:
    print(f"Page {page.page_number}: {page.char_count} chars")

Troubleshooting

  • If text-extractor is not found on Windows, ensure the Python Scripts directory is in PATH.
  • For MCP stdio mode, do not send random JSON to stdin; only an MCP client should talk to the server.
  • If OCR is not triggered on scanned docs, confirm Tesseract and Poppler are installed and visible to the process.

Release

Quick publish flow:

python -m build
python -m twine check dist/*
python -m twine upload --repository testpypi dist/*
python -m twine upload dist/*

For the full step-by-step process, see RELEASE_CHECKLIST.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_extractor_lightweight-1.0.1.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_extractor_lightweight-1.0.1-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file text_extractor_lightweight-1.0.1.tar.gz.

File metadata

File hashes

Hashes for text_extractor_lightweight-1.0.1.tar.gz
Algorithm Hash digest
SHA256 42e7a1a356825b146a5fe21c130346dd36df58a59bb6f6a2e25ba5df32cec23e
MD5 eb4f48b3fc85b79fd149d2bc1531e781
BLAKE2b-256 c8adafc353d3fe94a12fca714f08d94ab52f7f2ef14e7301ea37569b48153610

See more details on using hashes here.

File details

Details for the file text_extractor_lightweight-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for text_extractor_lightweight-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2dd1bf1e177bd9ee681873f61f3d8e4bda446ee609bf1a45423a4774ac22007
MD5 76c13be39deeee814c157efdf9db0814
BLAKE2b-256 3a8175fcb3569dc1773498faaef9811f595afe1a892fcc92cfa77b32fe49cca8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page