Skip to main content

GLM OCR - Optical Character Recognition powered by GLM

Project description

GLM-OCR

中文阅读

👋 Join our WeChat and Discord community
📖 Check out the GLM-OCR technical report
📍 Use GLM-OCR's API

Model Introduction

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Key Features

  • State-of-the-Art Performance: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.

  • Optimized for Real-World Scenarios: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts.

  • Efficient Inference: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments.

  • Easy to Use: Fully open-sourced and equipped with a comprehensive SDK and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.

News & Updates

  • [2026.3.12] GLM-OCR SDK now supports agent-friendly Skill mode — just pip install glmocr + set API key, ready to use via CLI or Python with no GPU or YAML config needed. See: GLM-OCR Skill
  • [2026.3.12] GLM-OCR Technical Report is now available. See: GLM-OCR Technical Report
  • [2026.2.12] Fine-tuning tutorial based on LLaMA-Factory is now available. See: GLM-OCR Fine-tuning Guide

Download Model

Model Download Links Precision
GLM-OCR 🤗 Hugging Face
🤖 ModelScope
BF16

GLM-OCR SDK

We provide an SDK for using GLM-OCR more efficiently and conveniently.

Install SDK

Choose the lightest installation that matches your scenario:

# Cloud / MaaS + local images / PDFs (fastest install)
pip install glmocr

# Self-hosted pipeline (layout detection)
pip install "glmocr[selfhosted]"

# Flask service support
pip install "glmocr[server]"

Install from source for development:

# Install from source
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .

Model Deployment

Two ways to use GLM-OCR:

Option 1: Zhipu MaaS API (Recommended for Quick Start)

Use the hosted cloud API – no GPU needed. The cloud service runs the complete GLM-OCR pipeline internally, so the SDK simply forwards your request and returns the result.

  1. Get an API key from https://open.bigmodel.cn
  2. Configure config.yaml:
pipeline:
  maas:
    enabled: true # Enable MaaS mode
    api_key: your-api-key # Required

That's it! When maas.enabled=true, the SDK acts as a thin wrapper that:

  • Forwards your documents to the Zhipu cloud API
  • Returns the results directly (Markdown + JSON layout details)
  • No local processing, no GPU required

Input note (MaaS): the upstream API accepts file as a URL or a data:<mime>;base64,... data URI. If you have raw base64 without the data: prefix, wrap it as a data URI (recommended). The SDK will auto-wrap local file paths / bytes / raw base64 into a data URI when calling MaaS.

API documentation: https://docs.bigmodel.cn/cn/guide/models/vlm/glm-ocr

Option 2: Self-host with vLLM / SGLang

Deploy the GLM-OCR model locally for full control. The SDK provides the complete pipeline: layout detection, parallel region OCR, and result formatting.

Install the self-hosted extra first:

pip install "glmocr[selfhosted]"
Using vLLM

Install vLLM:

docker pull vllm/vllm-openai:v0.19.0-ubuntu2404

Or using with pip:

pip install -U "vllm>=0.17.0"

Launch the service:

pip install "transformers>=5.3.0"

vllm serve zai-org/GLM-OCR  --port 8080 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' --served-model-name glm-ocr

Note Add --max-model-len and --gpu-memory-utilization according to Your own machine to handle large image/pdf

Using SGLang

Install SGLang:

docker pull lmsysorg/sglang:v0.5.10

Or using with pip:

pip install "sglang>=0.5.9"

Launch the service:

pip install "transformers>=5.3.0"

sglang serve --model zai-org/GLM-OCR --port 8080 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --served-model-name glm-ocr

Note Add --context-len and --mem-fraction-static according to Your own machine to handle large image/pdf

Option 3: Ollama/MLX

For specialized deployment scenarios, see the detailed guides:

Update Configuration

After launching the service, configure config.yaml:

pipeline:
  maas:
    enabled: false # Disable MaaS mode (default)
  ocr_api:
    api_host: localhost # or your vLLM/SGLang server address
    api_port: 8080

SDK Usage Guide

CLI

# Parse a single image
glmocr parse examples/source/code.png

# Parse a directory
glmocr parse examples/source/

# Set output directory
glmocr parse examples/source/code.png --output ./results/

# Use a custom config
glmocr parse examples/source/code.png --config my_config.yaml

# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG

# Run layout detection on CPU (keep GPU free for OCR model)
glmocr parse examples/source/code.png --layout-device cpu

# Run layout detection on a specific GPU
glmocr parse examples/source/code.png --layout-device cuda:1

# Override any config value via --set (dotted path, repeatable)
glmocr parse examples/source/code.png --set pipeline.ocr_api.api_port 8080
glmocr parse examples/source/ --set pipeline.layout.use_polygon true --set logging.level DEBUG

Python API

from glmocr import GlmOcr, parse

# Simple function
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"])
result = parse("https://example.com/image.png")
result.save(output_dir="./results")

# Note: a list is treated as pages of a single document.

# Class-based API
with GlmOcr() as parser:
    result = parser.parse("image.png")
    print(result.json_result)
    result.save()

# Place layout model on CPU (useful when GPU is reserved for OCR)
with GlmOcr(layout_device="cpu") as parser:
    result = parser.parse("image.png")

# Place layout model on a specific GPU
with GlmOcr(layout_device="cuda:1") as parser:
    result = parser.parse("image.png")

Flask Service

Install the optional server dependency first:

pip install "glmocr[server]"
# Start service
python -m glmocr.server

# With debug logging
python -m glmocr.server --log-level DEBUG

# Call API
curl -X POST http://localhost:5002/glmocr/parse \
  -H "Content-Type: application/json" \
  -d '{"images": ["./example/source/code.png"]}'

Semantics:

  • images can be a string or a list.
  • A list is treated as pages of a single document.
  • For multiple independent documents, call the endpoint multiple times (one document per request).

Modular Architecture

GLM-OCR uses composable modules for easy customization:

Component Description
PageLoader Preprocessing and image encoding
OCRClient Calls the GLM-OCR model service
PPDocLayoutDetector PP-DocLayout layout detection
ResultFormatter Post-processing, outputs JSON/Markdown

You can extend the behavior by creating custom pipelines:

from glmocr.dataloader import PageLoader
from glmocr.ocr_client import OCRClient
from glmocr.postprocess import ResultFormatter


class MyPipeline:
  def __init__(self, config):
    self.page_loader = PageLoader(config)
    self.ocr_client = OCRClient(config)
    self.formatter = ResultFormatter(config)

  def process(self, request_data):
    # Implement your own processing logic
    pass

Star History

Star History Chart

Acknowledgement

This project is inspired by the excellent work of the following projects and communities:

License

The Code of this repo is under Apache License 2.0.

The GLM-OCR model is released under the MIT License.

The complete OCR pipeline integrates PP-DocLayoutV3 for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.

Citation

If you find GLM-OCR useful in your research, please cite our technical report:

@misc{duan2026glmocrtechnicalreport,
      title={GLM-OCR Technical Report},
      author={Shuaiqi Duan and Yadong Xue and Weihan Wang and Zhe Su and Huan Liu and Sheng Yang and Guobing Gan and Guo Wang and Zihan Wang and Shengdong Yan and Dexin Jin and Yuxuan Zhang and Guohong Wen and Yanfeng Wang and Yutao Zhang and Xiaohan Zhang and Wenyi Hong and Yukuo Cen and Da Yin and Bin Chen and Wenmeng Yu and Xiaotao Gu and Jie Tang},
      year={2026},
      eprint={2603.10910},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.10910},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glmocr-0.1.5.tar.gz (103.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glmocr-0.1.5-py3-none-any.whl (115.2 kB view details)

Uploaded Python 3

File details

Details for the file glmocr-0.1.5.tar.gz.

File metadata

  • Download URL: glmocr-0.1.5.tar.gz
  • Upload date:
  • Size: 103.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for glmocr-0.1.5.tar.gz
Algorithm Hash digest
SHA256 22839b6d7a5dc51c331d1f8f7cfa89b8cda4889cfbb0e4cb64139fa4fb2264da
MD5 87db15e528b5c93ac418d309f6dae21f
BLAKE2b-256 b50a1c53e8d52ddc52074307db886c481116aab4b8effeb75955d1fc08b62cdb

See more details on using hashes here.

File details

Details for the file glmocr-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: glmocr-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 115.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for glmocr-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9649aec2e70eeb5a8ebed9979092c58ec5c4e075a085fb82c8fa3b04e4e40977
MD5 487213cc7b86107f0de91fde8db7e227
BLAKE2b-256 7004baff846a7f0c766b8b77f77c797c01ace154f7d8f9884a6a92711341c554

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page