Skip to main content

Local-first document toolkit. Streaming OCR (GLM-OCR via Ollama), PII anonymizer, and 19 chainable PDF/Markdown/HTML tools. MCP server included.

Project description

paperloom

Local-first document toolkit. Streaming OCR + PII anonymizer + 19 chainable tools. MCP-native.

paperloom is the Python library and MCP server behind paperloom — a local-first web app for OCR, PDF/Markdown/HTML transforms, and PII redaction. Every tool runs on your machine. No cloud round-trips, no telemetry.

Why paperloom

paperloom rides a state-of-the-art OCR model — GLM-OCR scores 94.62 on OmniDocBench V1.5 (rank #1) and is SOTA on formula / table recognition and information extraction. paperloom commits to tracking the current SOTA: when a stronger open model ships, the Ollama pin gets updated.

paperloom's value-add is agent orchestration around the model:

  • 19 chainable toolspdf-to-images → ocr → anonymize → markdown-to-pdf in one call.
  • MCP server with security modelregister_file + path allowlist + file_id tokens. Drop-in for Claude Desktop, Claude Code, Cursor, Cline, Agno.
  • Built-in PII redaction — OPF model, 8 entity categories, verbatim.
  • Streaming SSE — Markdown emits page-by-page as the OCR model writes it.
  • One Ollama dep — reuses any GLM-OCR model you already pulled. No multi-GB model zoo download.

For raw model quality on dense scientific PDFs, marker, docling, and MinerU are excellent companion projects — paperloom doesn't try to out-research them, it focuses on the orchestration and privacy layer around the model.

Install

# Library + CLI + MCP server (no PDF rendering, no anonymizer):
uvx paperloom doctor

# Full toolkit:
uvx --with 'paperloom[all]' paperloom doctor

# Or pip:
pip install paperloom            # core
pip install 'paperloom[pdf]'     # + WeasyPrint (markdown→pdf, html→pdf)
pip install 'paperloom[all]'     # everything published on PyPI

pdf extra needs native libs (brew install pango on macOS).

The OPF anonymizer is not a PyPI extra — it's distributed as a git repo. The anonymize tool auto-installs it on first call (~250 MB Python deps + ~4 GB checkpoint). To opt out and install manually:

PAPERLOOM_AUTO_INSTALL_OPF=0  # disables the auto-installer
uv pip install 'opf @ git+https://github.com/openai/privacy-filter@main'

Use as a library

from paperloom import ocr_to_markdown, anonymize, Chain

# One-shot OCR
md = ocr_to_markdown("scan.pdf")

# Redact PII
clean = anonymize(md, preset="balanced")

# Compose tools
result = Chain([
    ("pdf-to-images", {"dpi": 200}),
    ("ocr-to-markdown", {}),
    ("anonymize", {"preset": "recall"}),
]).run(["doc.pdf"])

Use as an MCP server

uvx --from paperloom paperloom-mcp

Wire into Claude Desktop:

{
  "mcpServers": {
    "paperloom": {
      "command": "uvx",
      "args": ["--from", "paperloom", "paperloom-mcp"],
      "env": {
        "PAPERLOOM_MCP_ALLOWED_DIRS": "/Users/you/Documents,/Users/you/Downloads"
      }
    }
  }
}

Use from the CLI

paperloom ocr scan.pdf -o out.md
paperloom anonymize out.md --preset recall
paperloom chain --steps pdf-to-images,ocr-to-markdown,anonymize doc.pdf
paperloom doctor      # check Ollama, glm-ocr, OPF, allowlist

Requirements

  • Ollama with glm-ocr:latest pulled (ollama pull glm-ocr:latest).
  • Python 3.11+.
  • Hardware: GLM-OCR is RAM-bound. Recommended: Apple Silicon M-series Pro with ≥ 24 GB unified RAM, or x86 with 24 GB+ GPU VRAM. Minimum 16 GB. Below 16 GB the OS can freeze hard enough to require a reboot — don't.
  • (Optional) pango for the pdf extra; ~4 GB checkpoint for the anonymizer extra (auto-downloaded on first call).

License

MIT. Depends on OpenAI Privacy Filter (Apache 2.0) when the anonymizer extra is installed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperloom-0.1.0.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperloom-0.1.0-py3-none-any.whl (67.8 kB view details)

Uploaded Python 3

File details

Details for the file paperloom-0.1.0.tar.gz.

File metadata

  • Download URL: paperloom-0.1.0.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for paperloom-0.1.0.tar.gz
Algorithm Hash digest
SHA256 abf9d6eece35e72a18814cb3ce02eac43b53015a45fcd4ae1b4c56f766f6f395
MD5 68a9df4aeb8f1b3037401448ee678378
BLAKE2b-256 ca515cd0580e1c185680a7835bd21804b931b3f24323006dc425d3d952df6039

See more details on using hashes here.

File details

Details for the file paperloom-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paperloom-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 67.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for paperloom-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 231b8eea2e86b7e54d4554cd3ab1a2eb083c418871a0a6ae7e08562c544986ef
MD5 3fdfcc44e26c7b44412754c1bbf8b95f
BLAKE2b-256 1c56dfad84a9f0727bf9e4f9ef97e6bc8697fcecf63b2a59cafc3a3041764faf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page