Skip to main content

Docling plugin for Surya OCR

Project description

Docling SuryaOCR Plugin

Docling plugin that brings the powerful Surya OCR engine into Docling.

License: GPL-3.0-only – must be used as an external plugin (allow_external_plugins=True).


Installation (uv)

uv pip install docling-surya

Note:

  • Supported only on Linux x86_64 (matches surya-ocr dependency).
  • Models (~1–2 GB) are downloaded automatically on first use.
  • Cached under ~/.cache/huggingface (or HF_HOME if set).

Python Usage

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling_surya import SuryaOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_model="suryaocr",           # Plugin engine name
    allow_external_plugins=True,     # Required for third-party plugins
    ocr_options=SuryaOcrOptions(
        lang=["en"],                 # OCR language(s)
        use_gpu=True,                # Optional: force GPU
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
        InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
    }
)

result = converter.convert("path/to/document.pdf")
print(result.document.export_to_markdown())

CLI Usage

# List available external plugins (should show "surya-ocr")
docling --show-external-plugins

# Run conversion with Surya OCR
docling --allow-external-plugins --ocr-engine=suryaocr path/to/document.pdf

Example Script

See examples/docling_with_custom_models.py:

uv run examples/docling_with_custom_models.py

Processes a sample EPA PDF and prints Markdown output.


Development with uv

# Clone the repo
git clone https://github.com/harrykhh/docling_surya
cd docling_surya

# Create virtual environment + install deps
uv venv
uv sync --all-extras

# Run tests
uv run pytest

# Run linter
uv run ruff check .

# Build wheel
uv build

# Install locally
uv pip install dist/docling_surya_ocr-*.whl

# Publish to PyPI (requires token)
uv publish

Project Structure

docling_surya/
├── pyproject.toml
├── uv.lock
├── README.md
├── LICENSE (GPL-3.0)
├── doclin_surya/
│   ├── __init__.py
│   └── plugin.py           # Full SuryaOcrModel + factory
├── examples/
│   └── docling_with_custom_models.py
└── tests/
    └── test_surya_ocr.py

Plugin Registration

The plugin registers via:

[project.entry-points."docling"]
surya-ocr = "docling_surya.plugin"

And exports the OCR engine via:

def ocr_engines():
    return {"ocr_engines": [SuryaOcrModel]}

License & Attribution


Enjoy high-accuracy OCR on complex PDFs with Docling + Surya!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_surya-0.1.0.tar.gz (129.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_surya-0.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file docling_surya-0.1.0.tar.gz.

File metadata

  • Download URL: docling_surya-0.1.0.tar.gz
  • Upload date:
  • Size: 129.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_surya-0.1.0.tar.gz
Algorithm Hash digest
SHA256 050ba967733f99da5d900d3d97bf203522b510cd698c7283b304a33b3d064df8
MD5 7e3cf8faf9dd5e6626579729e08324a2
BLAKE2b-256 5e8c0b429cee0eec4da640c52b2d05d295fb90c270071d566993bff8ec92adb4

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_surya-0.1.0.tar.gz:

Publisher: python-publish.yml on harrykhh/docling_surya

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docling_surya-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docling_surya-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_surya-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06c43dd2f94d06b680e0364181c68ae297a75961412aea8822f542e7e277ee19
MD5 916a807de39d7304acf8191f601f1ef3
BLAKE2b-256 4ed94875cfeb9651168eaee64ccc0345628a5cfef2691aac746aa9d97755ece8

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_surya-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on harrykhh/docling_surya

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page