Skip to main content

MLX-based PP-OCRv6 inference on Apple Silicon

Project description

mlx4ocr

Lint PyPI Python 3.12+ Apple MLX Apple Silicon License: MIT

Apple Silicon OCR powered by MLX, PP-OCRv6, and an optional VLM backend for GLM-OCR and PaddleOCR-VL.

mlx4ocr reimplements PP-OCRv6 detection and recognition for local macOS inference. It downloads official Hugging Face safetensors weights on demand and runs the OCR pipeline without a PaddlePaddle runtime. For generated-text OCR, it can also run GLM-OCR and PaddleOCR-VL through the optional mlx-vlm extra.

[!NOTE] This project is pre-alpha. APIs and output details may change while the MLX port is being completed and validated.

Features

  • PP-OCRv6 text detection and recognition on Apple Silicon with MLX.
  • Official tiny, small, and medium Hugging Face model variants.
  • Image, PDF, and non-recursive directory inputs from the CLI.
  • Plain text, Markdown, and PaddleOCR-style JSON output.
  • Optional GLM-OCR and PaddleOCR-VL generated-text backends through mlx-vlm.
  • Optional saved output layout compatible with document OCR workflows.
  • Optional MCP server and installable agent skill for compatible coding agents.

Requirements

  • macOS on Apple Silicon.
  • Python 3.12 or newer.
  • uv for local development and CLI execution.
  • Internet access on first run to download model weights from Hugging Face.
  • For optional VLM OCR, enough disk space for the selected VLM checkpoint. The GLM-OCR mlx-community/GLM-OCR-bf16 main model.safetensors file is about 2.2 GB; PaddleOCR-VL size is not yet benchmarked in this project.

Installation

Install the CLI from PyPI with uv tool:

uv tool install mlx4ocr

Or install directly from GitHub:

uv tool install git+https://github.com/shuuul/mlx4ocr.git

Or run the CLI without installing it permanently:

uvx --from mlx4ocr mlx4ocr --help

For development from a checkout:

git clone https://github.com/shuuul/mlx4ocr.git
cd mlx4ocr
uv sync --group dev

Optional VLM OCR support through mlx-vlm is available as an extra:

uv sync --extra vlm
uv tool install 'mlx4ocr[vlm]'

Quick start

Run OCR on an image and print Markdown to stdout:

mlx4ocr --path input.png --format markdown

From a development checkout, you can run the bundled examples with uv run:

uv run mlx4ocr --path examples/images/img_10.jpg --format markdown

Use uvx when running directly from GitHub without installation:

uvx --from mlx4ocr \
  mlx4ocr --path input.png --format markdown

Python API:

import cv2

from mlx4ocr import PP_OCRv6

image = cv2.imread("examples/images/img_10.jpg")
ocr = PP_OCRv6.from_hub("medium")

try:
    result = ocr.predict(image)
    print(result.result.text)
    for block in result.result.blocks:
        print(block.text, block.box, block.detection_score, block.recognition_score)
    print(result.timing)
finally:
    ocr.close()

The Python OCRResult is block-based. result.result.text contains recognized lines joined with newlines, and result.result.blocks contains OCRTextBlock items with optional geometry and detection/recognition scores.

Optional VLM OCR API, using GLM-OCR or PaddleOCR-VL through mlx-vlm:

from mlx4ocr import VLMOCR

# GLM-OCR is the default VLM preset.
ocr = VLMOCR.from_hub()

try:
    result = ocr.predict_path("examples/images/img_10.jpg")
    print(result.text)
finally:
    ocr.close()

Use PaddleOCR-VL by selecting the preset engine:

from mlx4ocr import VLMOCR

ocr = VLMOCR.from_hub(engine="paddleocr-vl", task="chart")

try:
    result = ocr.predict_path("chart.png")
    print(result.text)
finally:
    ocr.close()

This API requires installing the vlm extra. It currently returns the generated text as one OCRTextBlock without geometry, detection scores, or recognition scores.

CLI usage

The CLI accepts image files, PDF files, or a non-recursive directory of supported inputs:

mlx4ocr --path examples/images --format json --output ocr-output

Supported output formats:

  • txt — recognized text only.
  • markdown — recognized text as Markdown, preserving PDF page headings.
  • json — PaddleOCR/PaddleX-compatible PP-OCRv6 res fields with PDF page_index metadata. This format requires OCR blocks with geometry and detection/recognition scores.

PP-OCRv6 is the default CLI engine. Optional VLM CLI inference is available through mlx-vlm after installing the vlm extra. Supported VLM presets are GLM-OCR and PaddleOCR-VL:

uv sync --extra vlm
mlx4ocr --path input.png --engine glm-ocr --format markdown
mlx4ocr --path input.pdf --engine glm-ocr --vlm-task table --max-tokens 1024
mlx4ocr --path chart.png --engine paddleocr-vl --vlm-task chart --format markdown

Use --vlm-model to select a different compatible Hugging Face model. GLM-OCR tasks are text, formula, table, and schema. PaddleOCR-VL tasks are text, formula, table, chart, and schema. The schema task requires a custom --prompt for both VLM engines. JSON output for VLM engines uses a generalized result object with generated text and blocks, not the PaddleX res geometry fields.

Engine comparison and VLM resource notes

mlx4ocr has three local MLX OCR engine presets:

  • ppocrv6 — default detector/recognizer pipeline. It returns text blocks with geometry and detection/recognition scores.
  • glm-ocr — optional VLM generated-text pipeline. It can produce more natural page text, but it does not return text boxes or confidence scores.
  • paddleocr-vl — optional PaddleOCR-VL generated-text pipeline through mlx-vlm, including chart recognition prompts. It also does not return text boxes or confidence scores.

VLM engines are much heavier than the default PP-OCRv6 path. On first use, mlx-vlm downloads the selected model. The GLM-OCR preset downloads mlx-community/GLM-OCR-bf16; the main safetensors file is about 2.2 GB. PaddleOCR-VL uses PaddlePaddle/PaddleOCR-VL; the local Hugging Face cache was about 1.8 GB in the smoke test below. If a download stalls through Hugging Face/Xet, retry with:

HF_HUB_DISABLE_XET=1 uv run --extra vlm mlx4ocr \
  --path examples/ppocrv6.pdf --engine glm-ocr --format markdown --start 0 --end 0

Smoke-test timings on the development machine for examples/ppocrv6.pdf, with models already cached:

Engine Input Options Wall time Max RSS
ppocrv6 page 1 --start 0 --end 0 ~8.6 s ~1.17 GB
glm-ocr page 1 --start 0 --end 0 --max-tokens 128 ~11 s ~2.5 GB
paddleocr-vl page 1 --start 0 --end 0 --max-tokens 128 ~13.4 s ~2.49 GB
ppocrv6 full 10-page PDF default tokens N/A ~1 min 21 s ~1.30 GB
glm-ocr full 10-page PDF --max-tokens 256 ~3 min 37 s ~3.1 GB
paddleocr-vl full 10-page PDF --max-tokens 256 ~5 min 21 s ~2.48 GB

These numbers are indicative rather than a guarantee. Actual time and memory depend on the Mac, MLX version, image/PDF resolution, prompt, and --max-tokens. Increase --max-tokens for long VLM pages to reduce truncation; expect processing time to increase with the generated output length.

PDF page ranges use 0-based inclusive page indexes:

mlx4ocr --path docs/report.pdf --format markdown --start 0 --end 2

When --output is omitted, results are printed to stdout. When --output is provided, files are written with this layout:

  • <output>/<stem>/ocr/<stem>.txt — plain text output.
  • <output>/<stem>/ocr/<stem>.md — Markdown output.
  • <output>/<stem>/ocr/<stem>.json — JSON output.
  • <output>/<stem>/ocr/<stem>_origin.pdf — original PDF copy for PDF inputs.

Useful options:

mlx4ocr --help
mlx4ocr --path input.png --variant tiny --format txt
mlx4ocr --path input.pdf --rec-weight-source auto --no-compile

MCP server

Optional MCP support is available as an extra:

uv sync --extra mcp
uv run mlx4ocr-mcp

The MCP server exposes an ocr_markdown tool that reads a local image path and returns Markdown OCR output.

Agent skill

This repository includes an agent skill for compatible coding agents. Install it with npx skills:

npx skills add shuuul/mlx4ocr

After installation, compatible agents can use the skill to run mlx4ocr directly from GitHub with uvx or uv tool on macOS.

Model variants

Variant Detection Hub repo Recognition Hub repo
tiny PaddlePaddle/PP-OCRv6_tiny_det_safetensors PaddlePaddle/PP-OCRv6_tiny_rec_safetensors
small PaddlePaddle/PP-OCRv6_small_det_safetensors PaddlePaddle/PP-OCRv6_small_rec_safetensors
medium PaddlePaddle/PP-OCRv6_medium_det_safetensors PaddlePaddle/PP-OCRv6_medium_rec_safetensors

Use det_variant or rec_variant when detection and recognition should use different tiers. For example, MinerU ch_server is closest to small detection with medium recognition:

from mlx4ocr import PP_OCRv6

ocr = PP_OCRv6.from_hub("medium", det_variant="small")

To download model artifacts without constructing the OCR pipeline:

from mlx4ocr import download_model

artifacts = download_model("medium", "det")
print(artifacts.config_data["model_type"])

Recognition weights for small and medium

The Hugging Face small and medium recognition safetensors currently ship swapped head.encoder.conv_block tensors. By default, PP_OCRv6.from_hub(..., rec_weight_source="auto") and load_recognition_model() patch the affected tensors from official Paddle pretrained checkpoints. The checkpoints are downloaded once to .cache/paddle_pretrained/; PaddlePaddle is not required.

Use rec_weight_source="hub" only when you explicitly want the raw Hugging Face safetensors.

Development

uv sync --group dev
uv run pytest
uv run ruff check .
uv run prek run --all-files

See AGENTS.md for architecture notes and coding conventions.

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx4ocr-0.1.2.tar.gz (50.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx4ocr-0.1.2-py3-none-any.whl (72.2 kB view details)

Uploaded Python 3

File details

Details for the file mlx4ocr-0.1.2.tar.gz.

File metadata

  • Download URL: mlx4ocr-0.1.2.tar.gz
  • Upload date:
  • Size: 50.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mlx4ocr-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d57691252a446a2095d22c053489bdbd6e38014bfeb884c690a005ec21f955bf
MD5 fd8025473b37bfe45173dfaab6575530
BLAKE2b-256 29d471586cdfe3547691cb792918ecd934d00853520f22c3d53ad53a91b52f94

See more details on using hashes here.

File details

Details for the file mlx4ocr-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mlx4ocr-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 72.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mlx4ocr-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 417c74a43a02acc6bd20a61971ef127fca9d3f39e2460557276ed54a7f17a35a
MD5 867e0626d422c22bc9d7c8a1793277ef
BLAKE2b-256 3296f18e827d4e9f1eb8748364cbd163b159cd2f559327f34b63e059341ec72c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page