Lightweight MCP server + CLI for extracting text from PDFs and images. Agent-first, zero-config.
Project description
Text Extractor
Lightweight MCP server + CLI for extracting text from PDFs and images, designed for agent workflows.
It is fast on normal digital PDFs, but can still recover text from difficult documents with quality-aware fallback and optional OCR.
Features
- Zero-config extraction for standard PDFs
- Smart fallback chain for low-quality or scanned pages
- Optional OCR stack for image-heavy PDFs and image files
- MCP tools ready for VS Code, Claude, and agent runtimes
- CLI and Python API in one package
Install
Install the package:
pip install text-extractor-lightweight
Optional extras:
# Fast extraction via PyMuPDF (recommended — handles complex fonts, 10-20x faster)
pip install "text-extractor-lightweight[fast]"
# OCR support (scanned PDFs, images)
pip install "text-extractor-lightweight[ocr]"
# Better handling of complex layouts
pip install "text-extractor-lightweight[docling]"
# Everything
pip install "text-extractor-lightweight[all]"
System dependencies for OCR:
- Tesseract OCR: https://github.com/tesseract-ocr/tesseract
- Poppler (
pdftoppm) for PDF to image conversion
Windows (winget):
winget install --id tesseract-ocr.tesseract -e
winget install --id oschwartz10612.Poppler -e
CLI
The package name is text-extractor-lightweight, and it installs CLI commands text-extractor and text-extractor-mcp.
# Extract full text from a PDF
text-extractor report.pdf
# Extract from an image
text-extractor screenshot.png
# Extract only specific pages
text-extractor report.pdf --pages 1-5
# Show document metadata/summary only
text-extractor report.pdf --info
# Show which strategy would be used
text-extractor report.pdf --strategy
# Chunk large output by token budget
text-extractor large.pdf --chunk-tokens 50000
MCP Server
Run directly with uvx:
uvx --from text-extractor-lightweight text-extractor-mcp
Claude Code
claude mcp add text-extractor -- uvx --from text-extractor-lightweight text-extractor-mcp
VS Code / GitHub Copilot
Add to .vscode/mcp.json:
{
"servers": {
"text-extractor": {
"command": "uvx",
"args": ["--from", "text-extractor-lightweight", "text-extractor-mcp"]
}
}
}
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"text-extractor": {
"command": "uvx",
"args": ["--from", "text-extractor-lightweight", "text-extractor-mcp"]
}
}
}
Exposed MCP Tools
| Tool | Description |
|---|---|
extract_text_from_file |
Extract full text from a PDF or image as markdown |
extract_text_pages |
Extract text from a specific page range |
get_document_info |
Get page count, type, metadata, and token estimate |
Extraction Strategy
Automatic routing selects the best backend by file type and quality:
Image file -> Tesseract OCR
Digital PDF -> pymupdf (fast path, recommended)
pypdf (fallback if pymupdf not installed)
Garbled PDF text -> pdfplumber -> docling (optional)
Scanned/image PDF -> pdf2image + Tesseract OCR
PDF fallback chain:
pymupdf -> pdfplumber -> docling -> pdf2image+OCR
PyMuPDF (pip install "text-extractor-lightweight[fast]") is the preferred PDF backend. It correctly decodes custom embedded fonts (where pypdf/pdfplumber emit garbled (cid:N) output) and is 10–20x faster than pypdf.
Python API
from text_extractor.extract import extract_text, extract_raw
# Markdown output
markdown = extract_text("report.pdf")
# Structured output
result = extract_raw("report.pdf")
print(result.total_pages, result.estimated_tokens)
for page in result.pages:
print(f"Page {page.page_number}: {page.char_count} chars")
Troubleshooting
- If
text-extractoris not found on Windows, ensure the Python Scripts directory is in PATH. - For MCP stdio mode, do not send random JSON to stdin; only an MCP client should talk to the server.
- If OCR is not triggered on scanned docs, confirm Tesseract and Poppler are installed and visible to the process.
Release
Quick publish flow:
python -m build
python -m twine check dist/*
python -m twine upload --repository testpypi dist/*
python -m twine upload dist/*
For the full step-by-step process, see RELEASE_CHECKLIST.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_extractor_lightweight-1.0.1.tar.gz.
File metadata
- Download URL: text_extractor_lightweight-1.0.1.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42e7a1a356825b146a5fe21c130346dd36df58a59bb6f6a2e25ba5df32cec23e
|
|
| MD5 |
eb4f48b3fc85b79fd149d2bc1531e781
|
|
| BLAKE2b-256 |
c8adafc353d3fe94a12fca714f08d94ab52f7f2ef14e7301ea37569b48153610
|
File details
Details for the file text_extractor_lightweight-1.0.1-py3-none-any.whl.
File metadata
- Download URL: text_extractor_lightweight-1.0.1-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2dd1bf1e177bd9ee681873f61f3d8e4bda446ee609bf1a45423a4774ac22007
|
|
| MD5 |
76c13be39deeee814c157efdf9db0814
|
|
| BLAKE2b-256 |
3a8175fcb3569dc1773498faaef9811f595afe1a892fcc92cfa77b32fe49cca8
|