Skip to main content

Utility functions for reading PageXML files

Project description

pagexml-tools

GitHub Actions Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status PyPI PyPI - Python Version

Utility functions for reading PageXML files

installing

using poetry

poetry add pagexml-tools

using pip

pip install pagexml-tools

Using

PageXML-tools contains functions for parsing and for a range of analysis tasks.

Parsing PageXML files and the Physical Document model

There is a tutorial that demonstrates the physical document model API

PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.

from pagexml.parser import parse_pagexml_file

pagexml_file = "path/to/pagexml_file.xml"

page_doc = parse_pagexml_file(pagexml_file)

# a page document has an ID
print(page_doc.id)

# print descriptive statistics
print(page_doc.stats)

# iterative over text regions and lines
for tr in page_doc.text_regions:
    # a text_region has an ID and a bounding box derived from its coordinates
    print(tr.id, tr.coords.box)
    # a text_region can have sub-text_regions and lines
    for line in tr.lines:
        # a line has an ID, coordinates and text
        print(line.id, line.coords.box, line.text)

In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:

  • reading sets of PageXML files from a archive (tar, zip) file (tutorial),
  • searching in text (keyword in context, keywords or fuzzy search)
  • reading and working with tables (table processing)
  • classifying physical document types in a large set of PageXML documents (tutorial),
  • checking the quality of the HTR/OCR process (tutorial),
  • comparing subsets (tutorial),
  • identifying document sections in sequences of PageXML documents (tutorial),
  • turning text lines into running text (tutorial),
  • supporting different reading orders (tutorial),
  • reinterpreting and restructuring text regions and lines (tutorial),
  • turning physical structure into logical structure,

Using PageXML-tools in R:

#Python
# run in CMD (Command) first (first time only): 
py -m pip install pagexml-tools
# Install and start Python 3.11 in background 
# R
# run in different window in R first time only! 
install.packages("reticulate")
# run in different window in R first: 
library(reticulate)
# run in different window in R next: 
py_require(c("pagexml-tools")) 

# Import PageXML file parser:

from pagexml.parser import parse_pagexml_file

Thanks to Milan van Lange for this R example.


USAGE | CONTRIBUTING | LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagexml_tools-0.8.1.tar.gz (74.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pagexml_tools-0.8.1-py3-none-any.whl (84.8 kB view details)

Uploaded Python 3

File details

Details for the file pagexml_tools-0.8.1.tar.gz.

File metadata

  • Download URL: pagexml_tools-0.8.1.tar.gz
  • Upload date:
  • Size: 74.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.11 Darwin/25.2.0

File hashes

Hashes for pagexml_tools-0.8.1.tar.gz
Algorithm Hash digest
SHA256 c558eb3e0efc658f70779cc593c359cf9a459888623db88069a1842b6c735998
MD5 5614208df0987a380739442e59045c0b
BLAKE2b-256 2e71037080298765c90818075ddcd4a3781e8e8cf9349c8dde761178b93e9362

See more details on using hashes here.

File details

Details for the file pagexml_tools-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: pagexml_tools-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 84.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.11 Darwin/25.2.0

File hashes

Hashes for pagexml_tools-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0da362fedc8585230c13653b662e2f6df1c342b41bea607c0ad792859cd9e70e
MD5 5566c0e7606fac6b2c9a28669120485b
BLAKE2b-256 7dc8600f2784f50116cbcb13f81351172a37b8dc2c72772c5b341898786638a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page