Skip to main content

A docling OCR plugin for GLM-OCR

Project description

docling-glm-ocr

A docling OCR plugin that delegates text recognition to a remote GLM-OCR model served by vLLM.


GitHub  |  PyPI


PyPI version Python versions License: MIT CI Ruff codecov

Overview

docling-glm-ocr is a docling plugin that replaces the built-in OCR stage with a call to a remote GLM-OCR model hosted on a vLLM server.

Each page crop is sent to the vLLM OpenAI-compatible chat completion endpoint as a base64-encoded image. The model returns Markdown-formatted text which docling merges back into the document structure.

The plugin registers itself under the "glm-ocr-remote" OCR engine key so it can be selected per-request through docling or docling-serve without changing application code.

Requirements

  • Python 3.13+
  • A running vLLM server hosting zai-org/GLM-OCR (or any compatible model)

Installation

# with uv (recommended)
uv add docling-glm-ocr

# with pip
pip install docling-glm-ocr

Usage

Python SDK

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

from docling_glm_ocr import GlmOcrRemoteOptions

pipeline_options = PdfPipelineOptions(
    allow_external_plugins=True,
    ocr_options=GlmOcrRemoteOptions(
        api_url="http://localhost:8001/v1/chat/completions",
        model_name="zai-org/GLM-OCR",
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())

docling-serve

Select the engine per-request via the standard API:

curl -X POST http://localhost:5001/v1/convert/source \
  -H 'Content-Type: application/json' \
  -d '{
    "options": {
      "ocr_engine": "glm-ocr-remote"
    },
    "sources": [{"kind": "http", "url": "https://arxiv.org/pdf/2501.17887"}]
  }'

The server must have DOCLING_SERVE_ALLOW_EXTERNAL_PLUGINS=true set so the plugin is loaded automatically.

Configuration

All options can be set via environment variables (useful for Docker / Compose deployments) or programmatically via GlmOcrRemoteOptions. Explicit constructor arguments always take precedence over environment variables.

Environment variables

Variable Description Default
GLMOCR_REMOTE_OCR_API_URL vLLM chat completion URL http://localhost:8001/v1/chat/completions
GLMOCR_REMOTE_OCR_MODEL_NAME Model name sent to vLLM zai-org/GLM-OCR
GLMOCR_REMOTE_OCR_PROMPT Text prompt sent with each image crop see below
GLMOCR_REMOTE_OCR_TIMEOUT HTTP timeout per crop (seconds) 120
GLMOCR_REMOTE_OCR_MAX_TOKENS Max tokens per completion 16384
GLMOCR_REMOTE_OCR_SCALE Image crop rendering scale 3.0
GLMOCR_REMOTE_OCR_MAX_IMAGE_PIXELS Pixel budget per crop 4500000
GLMOCR_REMOTE_OCR_MAX_CONCURRENT_REQUESTS Max concurrent API requests 10
GLMOCR_REMOTE_OCR_MAX_RETRIES Max retry attempts for HTTP errors 3
GLMOCR_REMOTE_OCR_RETRY_BACKOFF_FACTOR Exponential backoff factor for retries 2.0
GLMOCR_REMOTE_OCR_LANG Comma-separated language hint(s) en
GLMOCR_REMOTE_OCR_API_KEY Bearer token for Authorization header unset (no header sent)

GlmOcrRemoteOptions

All options can also be set programmatically via GlmOcrRemoteOptions:

Option Type Description Default
api_url str OpenAI-compatible chat completion URL GLMOCR_REMOTE_OCR_API_URL env or http://localhost:8001/v1/chat/completions
model_name str Model name sent to vLLM GLMOCR_REMOTE_OCR_MODEL_NAME env or zai-org/GLM-OCR
prompt str Text prompt for each image crop GLMOCR_REMOTE_OCR_PROMPT env or default prompt
timeout float HTTP timeout per crop (seconds) GLMOCR_REMOTE_OCR_TIMEOUT env or 120
max_tokens int Max tokens per completion GLMOCR_REMOTE_OCR_MAX_TOKENS env or 16384
scale float Image crop rendering scale GLMOCR_REMOTE_OCR_SCALE env or 3.0
max_image_pixels int Pixel budget per crop GLMOCR_REMOTE_OCR_MAX_IMAGE_PIXELS env or 4500000
max_concurrent_requests int Max concurrent API requests GLMOCR_REMOTE_OCR_MAX_CONCURRENT_REQUESTS env or 10
max_retries int Max retry attempts for HTTP errors GLMOCR_REMOTE_OCR_MAX_RETRIES env or 3
retry_backoff_factor float Exponential backoff factor for retries GLMOCR_REMOTE_OCR_RETRY_BACKOFF_FACTOR env or 2.0
lang list[str] Language hint (passed to docling) GLMOCR_REMOTE_OCR_LANG env (comma-separated) or ["en"]
api_key str | None Bearer token sent in Authorization header GLMOCR_REMOTE_OCR_API_KEY env or None (no header)

Default prompt:

Recognize the text in the image and output in Markdown format.
Preserve the original layout (headings/paragraphs/tables/formulas).
Do not fabricate content that does not exist in the image.

Architecture

flowchart LR
    subgraph docling
        Pipeline --> GlmOcrRemoteModel
    end

    subgraph vLLM
        GLMOCR["zai-org/GLM-OCR"]
    end

    GlmOcrRemoteModel -- "POST /v1/chat/completions\n(base64 image)" --> GLMOCR
    GLMOCR -- "Markdown text" --> GlmOcrRemoteModel

For each page the model:

  1. Collects OCR regions from the docling layout analysis
  2. Renders each region using the page backend (scale configurable, default 3×)
  3. Encodes the crop as a base64 PNG data URI
  4. POSTs concurrent chat completion requests to the vLLM endpoint (with retry logic)
  5. Returns the recognised text as TextCell objects for docling to merge

Starting a GLM-OCR vLLM server

docker run -d \
  --rm --name ocr-glm \
  --gpus device=1 \
  --ipc=host \
  -p 8001:8000 \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e "HF_TOKEN=${HF_TOKEN:-}" \
  -e "LD_LIBRARY_PATH=/lib/x86_64-linux-gnu" \
  vllm/vllm-openai:v0.16.0-cu130 \
  zai-org/GLM-OCR \
  --port 8000 \
  --trust-remote-code \
  --max-num-batched-tokens 8192

The plugin will connect to http://localhost:8001/v1/chat/completions by default.

Required: --max-num-batched-tokens 8192

Without this flag, vLLM will reject any high-resolution image with HTTP 400.

In vLLM 0.16.0+ (v1 engine), the encoder cache size is derived from max_num_batched_tokens (default 2048 when chunked prefill is enabled):

encoder_cache_size = max(max_num_batched_tokens, model_max_tokens_per_image)
                   = max(2048, 4800)  ←  4800 is GLM-OCR's model floor
                   = 4800 tokens      ←  too small for real documents

The Glm46VImageProcessor encodes images at approximately 784 pixels per token (patch_size=14 × merge_size=2, squared). A typical A4 page rendered at scale 3× (1785 × 2526 px) produces 5760 tokens; a phone-photo crop at scale 3× can reach 6120 tokens — both exceed the default 4800-token cache and are rejected.

Setting --max-num-batched-tokens 8192 raises the encoder cache to max(8192, 4800) = 8192 tokens, which covers all real-world inputs with comfortable headroom.

Note: --limit-mm-per-prompt does not control the encoder cache size in vLLM 0.16.0. That flag only limits the count of images per request.

Development

Setup

git clone https://github.com/DCC-BS/docling-glm-ocr.git
cd docling-glm-ocr
make install

Available commands

make install     Install dependencies and pre-commit hooks
make check       Run all quality checks (ruff lint, format, ty type check)
make test        Run tests with coverage report
make build       Build distribution packages
make publish     Publish to PyPI

Running tests

make test

Tests are in tests/ and use pytest. Coverage reports are generated at coverage.xml and printed to the terminal.

End-to-end tests

The e2e tests hit a real vLLM server and are skipped by default. To run them, set the server URL and use the e2e marker:

GLMOCR_REMOTE_OCR_API_URL=http://localhost:8001/v1/chat/completions pytest -m e2e

Code quality

This project uses:

  • ruff – linting and formatting
  • ty – type checking
  • pre-commit – pre-commit hooks

Run all checks:

make check

Releasing

Releases are published to PyPI automatically. Update the version in pyproject.toml, then trigger the Publish workflow from GitHub Actions:

GitHub → Actions → Publish to PyPI → Run workflow

The workflow tags the commit, builds the package, and publishes to PyPI via trusted publishing.

License

MIT © Data Competence Center Basel-Stadt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_glm_ocr-0.5.0.tar.gz (154.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_glm_ocr-0.5.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file docling_glm_ocr-0.5.0.tar.gz.

File metadata

  • Download URL: docling_glm_ocr-0.5.0.tar.gz
  • Upload date:
  • Size: 154.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for docling_glm_ocr-0.5.0.tar.gz
Algorithm Hash digest
SHA256 5038bbe68b8b0b74e2470b83bf90d8793978eda550d2a0c214aa101830ea5ccc
MD5 c27da5e1969408eb834636dc9920b011
BLAKE2b-256 15b82f4af9367cf2c5010c5d080a323881d0f6bebb4cf19179d41940fed35603

See more details on using hashes here.

File details

Details for the file docling_glm_ocr-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: docling_glm_ocr-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for docling_glm_ocr-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a796ff933f852d11836f7e871a9f15c0646cd429bcc5dad8e0e7de8061606881
MD5 acf6f881e2b6228cacf66e1f2ad42843
BLAKE2b-256 4f3146199a69c39abcb5abff6b812c2de448340f39be7d690161cd9d5f7d86c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page