Lightweight MCP server + CLI for extracting text from PDFs and images. Agent-first, zero-config.

These details have not been verified by PyPI

Project description

Text Extractor

Lightweight MCP server + CLI for extracting text from PDFs and images, designed for agent workflows.

It is fast on normal digital PDFs, but can still recover text from difficult documents with quality-aware fallback and optional OCR.

Features

Zero-config extraction for standard PDFs
Smart fallback chain for low-quality or scanned pages
Optional OCR stack for image-heavy PDFs and image files
MCP tools ready for VS Code, Claude, and agent runtimes
CLI and Python API in one package

Install

Install the package:

pip install text-extractor-lightweight

Optional extras:

# Fast extraction via PyMuPDF (recommended — handles complex fonts, 10-20x faster)
pip install "text-extractor-lightweight[fast]"

# OCR support (scanned PDFs, images)
pip install "text-extractor-lightweight[ocr]"

# Better handling of complex layouts
pip install "text-extractor-lightweight[docling]"

# Everything
pip install "text-extractor-lightweight[all]"

System dependencies for OCR:

Tesseract OCR: https://github.com/tesseract-ocr/tesseract
Poppler (pdftoppm) for PDF to image conversion

Windows (winget):

winget install --id tesseract-ocr.tesseract -e
winget install --id oschwartz10612.Poppler -e

CLI

The package name is text-extractor-lightweight, and it installs CLI commands text-extractor and text-extractor-mcp.

# Extract full text from a PDF
text-extractor report.pdf

# Extract from an image
text-extractor screenshot.png

# Extract only specific pages
text-extractor report.pdf --pages 1-5

# Show document metadata/summary only
text-extractor report.pdf --info

# Show which strategy would be used
text-extractor report.pdf --strategy

# Chunk large output by token budget
text-extractor large.pdf --chunk-tokens 50000

MCP Server

Run directly with uvx:

uvx --from text-extractor-lightweight text-extractor-mcp

Claude Code

claude mcp add text-extractor -- uvx --from text-extractor-lightweight text-extractor-mcp

VS Code / GitHub Copilot

Add to .vscode/mcp.json:

{
  "servers": {
    "text-extractor": {
      "command": "uvx",
      "args": ["--from", "text-extractor-lightweight", "text-extractor-mcp"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "text-extractor": {
      "command": "uvx",
      "args": ["--from", "text-extractor-lightweight", "text-extractor-mcp"]
    }
  }
}

Exposed MCP Tools

Tool	Description
`extract_text_from_file`	Extract full text from a PDF or image as markdown
`extract_text_pages`	Extract text from a specific page range
`get_document_info`	Get page count, type, metadata, and token estimate

Extraction Strategy

Automatic routing selects the best backend by file type and quality:

Image file        -> Tesseract OCR
Digital PDF       -> pymupdf (fast path, recommended)
                     pypdf (fallback if pymupdf not installed)
Garbled PDF text  -> pdfplumber -> docling (optional)
Scanned/image PDF -> pdf2image + Tesseract OCR

PDF fallback chain:

pymupdf -> pdfplumber -> docling -> pdf2image+OCR

PyMuPDF (pip install "text-extractor-lightweight[fast]") is the preferred PDF backend. It correctly decodes custom embedded fonts (where pypdf/pdfplumber emit garbled (cid:N) output) and is 10–20x faster than pypdf.

Python API

from text_extractor.extract import extract_text, extract_raw

# Markdown output
markdown = extract_text("report.pdf")

# Structured output
result = extract_raw("report.pdf")
print(result.total_pages, result.estimated_tokens)
for page in result.pages:
    print(f"Page {page.page_number}: {page.char_count} chars")

Troubleshooting

If text-extractor is not found on Windows, ensure the Python Scripts directory is in PATH.
For MCP stdio mode, do not send random JSON to stdin; only an MCP client should talk to the server.
If OCR is not triggered on scanned docs, confirm Tesseract and Poppler are installed and visible to the process.

Release

Quick publish flow:

python -m build
python -m twine check dist/*
python -m twine upload --repository testpypi dist/*
python -m twine upload dist/*

For the full step-by-step process, see RELEASE_CHECKLIST.md.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.1

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_extractor_lightweight-1.0.1.tar.gz (19.5 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

text_extractor_lightweight-1.0.1-py3-none-any.whl (25.5 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file text_extractor_lightweight-1.0.1.tar.gz.

File metadata

Download URL: text_extractor_lightweight-1.0.1.tar.gz
Upload date: Apr 23, 2026
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for text_extractor_lightweight-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`42e7a1a356825b146a5fe21c130346dd36df58a59bb6f6a2e25ba5df32cec23e`
MD5	`eb4f48b3fc85b79fd149d2bc1531e781`
BLAKE2b-256	`c8adafc353d3fe94a12fca714f08d94ab52f7f2ef14e7301ea37569b48153610`

See more details on using hashes here.

File details

Details for the file text_extractor_lightweight-1.0.1-py3-none-any.whl.

File metadata

Download URL: text_extractor_lightweight-1.0.1-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 25.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for text_extractor_lightweight-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2dd1bf1e177bd9ee681873f61f3d8e4bda446ee609bf1a45423a4774ac22007`
MD5	`76c13be39deeee814c157efdf9db0814`
BLAKE2b-256	`3a8175fcb3569dc1773498faaef9811f595afe1a892fcc92cfa77b32fe49cca8`

See more details on using hashes here.

text-extractor-lightweight 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Text Extractor

Features

Install

CLI

MCP Server

Claude Code

VS Code / GitHub Copilot

Claude Desktop

Exposed MCP Tools

Extraction Strategy

Python API

Troubleshooting

Release

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes