Skip to main content

A minimal Python wrapper around the PAGE-XML format for OCR output

Project description

pygexml

A minimal Python wrapper around the PAGE-XML format for OCR output. Can also import ALTO-XML.

pygexml checks, tests and docs pypi release API docs online

Installation

pip install pygexml

Requires Python 3.12+.

Usage

from pygexml import Page

page = Page.from_xml_file("docs/xml_file.xml")

for line in page.all_text():
    print(line)

All dataclasses are serializable with to_dict/from_dict and to_json/from_json via dataclasses-json.

Data model

Class Import from
Page pygexml
Page, TextRegion, TextLine, Coords pygexml.page
Point, Box, Polygon pygexml.geometry

Page, TextRegion and TextLine each expose all_text() and all_words() iterators. Lookups by ID are available via lookup_region() and lookup_textline().

Refer to the online API docs for details.

Hypothesis strategies

The pygexml.strategies module provides Hypothesis strategies for all pygexml types, ready to use in property-based tests - including downstream projects:

from hypothesis import given
from pygexml.strategies import st_pages

@given(st_pages())
def test_my_page_processing(page):
    assert process(page) is not None

Refer to the pygexml.strategies API docs for details.

Development

pip install ".[dev,test,docs]"

black pygexml test          # format
mypy pygexml test           # type check
pyright pygexml test        # type check
pytest -v                   # tests
pdoc -o .api_docs pygexml/* # API docs

CI runs on Python 3.12, 3.13 and 3.14. API documentation is published to GitHub Pages on every push to main.

Contributing

Bug reports, feature requests and pull requests are welcome. Feel free to open draft pull requests early to invite discussion and collaboration.

Please note that this project has a Code of Conduct.

Copyright and License

Copyright (c) 2026 Mirko Westermeier, Katharina Dietz (SCDH, University of Münster)

Released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygexml-0.3.1.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pygexml-0.3.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file pygexml-0.3.1.tar.gz.

File metadata

  • Download URL: pygexml-0.3.1.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pygexml-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d8cd733eb98d9a2f74950b9da2769c3f86eee469542d9fcae343cebf9c422a51
MD5 69b1ef64d2cfea4494275bef08d474f3
BLAKE2b-256 6e4fda2e4749270cbca02364500074cb2b48a0916060f24bf712bffd1a72a072

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygexml-0.3.1.tar.gz:

Publisher: publish.yml on SCDH/pygexml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pygexml-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pygexml-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pygexml-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a9efbd6ecf6f465be188289f0cf35e5e93d9bb0d4de6b569695d7c30c6c8eec0
MD5 584c9d95c72f54f5f943f204a1e69310
BLAKE2b-256 a83325d5fe73688761c74e821a0d7fc007b6a314a53a82bf08b9d931b4709ac3

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygexml-0.3.1-py3-none-any.whl:

Publisher: publish.yml on SCDH/pygexml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page