A minimal Python wrapper around the PAGE-XML format for OCR output
Project description
pygexml
A minimal Python wrapper around the PAGE-XML format for OCR output. Can also import ALTO-XML.
Installation
pip install pygexml
Requires Python 3.12+.
Usage
from pygexml import Page
page = Page.from_xml_file("docs/xml_file.xml")
for line in page.all_text():
print(line)
All dataclasses are serializable with to_dict/from_dict and to_json/from_json via dataclasses-json.
Data model
| Class | Import from |
|---|---|
Page |
pygexml |
Page, TextRegion, TextLine, Coords |
pygexml.page |
Point, Box, Polygon |
pygexml.geometry |
Page, TextRegion and TextLine each expose all_text() and all_words() iterators.
Lookups by ID are available via lookup_region() and lookup_textline().
Refer to the online API docs for details.
Hypothesis strategies
The pygexml.strategies module provides Hypothesis strategies for all pygexml types, ready to use in property-based tests - including downstream projects:
from hypothesis import given
from pygexml.strategies import st_pages
@given(st_pages())
def test_my_page_processing(page):
assert process(page) is not None
Refer to the pygexml.strategies API docs for details.
Development
pip install ".[dev,test,docs]"
black pygexml test # format
mypy pygexml test # type check
pyright pygexml test # type check
pytest -v # tests
pdoc -o .api_docs pygexml/* # API docs
CI runs on Python 3.12, 3.13 and 3.14. API documentation is published to GitHub Pages on every push to main.
Contributing
Bug reports, feature requests and pull requests are welcome. Feel free to open draft pull requests early to invite discussion and collaboration.
Please note that this project has a Code of Conduct.
Copyright and License
Copyright (c) 2026 Mirko Westermeier, Katharina Dietz (SCDH, University of Münster)
Released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pygexml-0.3.0.tar.gz.
File metadata
- Download URL: pygexml-0.3.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97ede82f4074687928dcc9456790e9f1f7f2bf023e2f596a2b368213f661a93f
|
|
| MD5 |
165d4ee5f75439c169e33f7d6840e5ef
|
|
| BLAKE2b-256 |
773ad4e49d20994c4258dfcb950ef58464655cd8aa396b3eafc9ac9f3ffa2074
|
Provenance
The following attestation bundles were made for pygexml-0.3.0.tar.gz:
Publisher:
publish.yml on SCDH/pygexml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygexml-0.3.0.tar.gz -
Subject digest:
97ede82f4074687928dcc9456790e9f1f7f2bf023e2f596a2b368213f661a93f - Sigstore transparency entry: 1409276449
- Sigstore integration time:
-
Permalink:
SCDH/pygexml@cc9089a0cfb19e64f3ecbe397936f476bf93097d -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/SCDH
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cc9089a0cfb19e64f3ecbe397936f476bf93097d -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygexml-0.3.0-py3-none-any.whl.
File metadata
- Download URL: pygexml-0.3.0-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
914e6f943234d30d149e25a1e3e4110615079729508f91b61ec12c6396eac68e
|
|
| MD5 |
ef09f94db0598b5f9dac0fae04628433
|
|
| BLAKE2b-256 |
aeb36e12ca1742beee5d3b7e510bbbcaa34978877ef0f1e067ab597ceee2723a
|
Provenance
The following attestation bundles were made for pygexml-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on SCDH/pygexml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygexml-0.3.0-py3-none-any.whl -
Subject digest:
914e6f943234d30d149e25a1e3e4110615079729508f91b61ec12c6396eac68e - Sigstore transparency entry: 1409276456
- Sigstore integration time:
-
Permalink:
SCDH/pygexml@cc9089a0cfb19e64f3ecbe397936f476bf93097d -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/SCDH
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cc9089a0cfb19e64f3ecbe397936f476bf93097d -
Trigger Event:
push
-
Statement type: