Local-first document toolkit. Streaming OCR (GLM-OCR via Ollama), PII anonymizer, and 19 chainable PDF/Markdown/HTML tools. MCP server included.
Project description
paperloom
Local-first document toolkit. Streaming OCR + PII anonymizer + 19 chainable tools. MCP-native.
paperloom is the Python library and MCP server behind paperloom — a local-first web app for OCR, PDF/Markdown/HTML transforms, and PII redaction. Every tool runs on your machine. No cloud round-trips, no telemetry.
Why paperloom
paperloom rides a state-of-the-art OCR model — GLM-OCR scores 94.62 on OmniDocBench V1.5 (rank #1) and is SOTA on formula / table recognition and information extraction. paperloom commits to tracking the current SOTA: when a stronger open model ships, the Ollama pin gets updated.
paperloom's value-add is agent orchestration around the model:
- 19 chainable tools —
pdf-to-images → ocr → anonymize → markdown-to-pdfin one call. - MCP server with security model —
register_file+ path allowlist +file_idtokens. Drop-in for Claude Desktop, Claude Code, Cursor, Cline, Agno. - Built-in PII redaction — OPF model, 8 entity categories, verbatim.
- Streaming SSE — Markdown emits page-by-page as the OCR model writes it.
- One Ollama dep — reuses any GLM-OCR model you already pulled. No multi-GB model zoo download.
For raw model quality on dense scientific PDFs, marker, docling, and MinerU are excellent companion projects — paperloom doesn't try to out-research them, it focuses on the orchestration and privacy layer around the model.
Install
# Library + CLI + MCP server (no PDF rendering, no anonymizer):
uvx paperloom doctor
# Full toolkit:
uvx --with 'paperloom[all]' paperloom doctor
# Or pip:
pip install paperloom # core
pip install 'paperloom[pdf]' # + WeasyPrint (markdown→pdf, html→pdf)
pip install 'paperloom[all]' # everything published on PyPI
pdf extra needs native libs (brew install pango on macOS).
The OPF anonymizer is not a PyPI extra — it's distributed as a git repo. The anonymize tool auto-installs it on first call (~250 MB Python deps + ~4 GB checkpoint). To opt out and install manually:
PAPERLOOM_AUTO_INSTALL_OPF=0 # disables the auto-installer
uv pip install 'opf @ git+https://github.com/openai/privacy-filter@main'
Use as a library
from paperloom import ocr_to_markdown, anonymize, Chain
# One-shot OCR
md = ocr_to_markdown("scan.pdf")
# Redact PII
clean = anonymize(md, preset="balanced")
# Compose tools
result = Chain([
("pdf-to-images", {"dpi": 200}),
("ocr-to-markdown", {}),
("anonymize", {"preset": "recall"}),
]).run(["doc.pdf"])
Use as an MCP server
uvx --from paperloom paperloom-mcp
Wire into Claude Desktop:
{
"mcpServers": {
"paperloom": {
"command": "uvx",
"args": ["--from", "paperloom", "paperloom-mcp"],
"env": {
"PAPERLOOM_MCP_ALLOWED_DIRS": "/Users/you/Documents,/Users/you/Downloads"
}
}
}
}
Use from the CLI
paperloom ocr scan.pdf -o out.md
paperloom anonymize out.md --preset recall
paperloom chain --steps pdf-to-images,ocr-to-markdown,anonymize doc.pdf
paperloom doctor # check Ollama, glm-ocr, OPF, allowlist
Requirements
- Ollama with
glm-ocr:latestpulled (ollama pull glm-ocr:latest). - Python 3.11+.
- Hardware: GLM-OCR is RAM-bound. Recommended: Apple Silicon M-series Pro with ≥ 24 GB unified RAM, or x86 with 24 GB+ GPU VRAM. Minimum 16 GB. Below 16 GB the OS can freeze hard enough to require a reboot — don't.
- (Optional)
pangofor thepdfextra; ~4 GB checkpoint for theanonymizerextra (auto-downloaded on first call).
License
MIT. Depends on OpenAI Privacy Filter (Apache 2.0) when the anonymizer extra is installed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paperloom-0.1.0.tar.gz.
File metadata
- Download URL: paperloom-0.1.0.tar.gz
- Upload date:
- Size: 55.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abf9d6eece35e72a18814cb3ce02eac43b53015a45fcd4ae1b4c56f766f6f395
|
|
| MD5 |
68a9df4aeb8f1b3037401448ee678378
|
|
| BLAKE2b-256 |
ca515cd0580e1c185680a7835bd21804b931b3f24323006dc425d3d952df6039
|
File details
Details for the file paperloom-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paperloom-0.1.0-py3-none-any.whl
- Upload date:
- Size: 67.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
231b8eea2e86b7e54d4554cd3ab1a2eb083c418871a0a6ae7e08562c544986ef
|
|
| MD5 |
3fdfcc44e26c7b44412754c1bbf8b95f
|
|
| BLAKE2b-256 |
1c56dfad84a9f0727bf9e4f9ef97e6bc8697fcecf63b2a59cafc3a3041764faf
|