Skip to main content

Utility functions for reading PageXML files

Project description

pagexml-tools

GitHub Actions Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status PyPI PyPI - Python Version

Utility functions for reading PageXML files

installing

using poetry

poetry add pagexml-tools

using pip

pip install pagexml-tools

Using

PageXML-tools contains functions for parsing and for a range of analysis tasks.

Parsing PageXML files and the Physical Document model

There is a tutorial that demonstrates the physical document model API

PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.

from pagexml.parser import parse_pagexml_file

pagexml_file = "path/to/pagexml_file.xml"

page_doc = parse_pagexml_file(pagexml_file)

# a page document has an ID
print(page_doc.id)

# print descriptive statistics
print(page_doc.stats)

# iterative over text regions and lines
for tr in page_doc.text_regions:
    # a text_region has an ID and a bounding box derived from its coordinates
    print(tr.id, tr.coords.box)
    # a text_region can have sub-text_regions and lines
    for line in tr.lines:
        # a line has an ID, coordinates and text
        print(line.id, line.coords.box, line.text)

In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:

  • reading sets of PageXML files from a archive (tar, zip) file (tutorial),
  • searching in text (keyword in context, keywords or fuzzy search)
  • reading and working with tables (table processing)
  • classifying physical document types in a large set of PageXML documents (tutorial),
  • checking the quality of the HTR/OCR process (tutorial),
  • comparing subsets (tutorial),
  • identifying document sections in sequences of PageXML documents (tutorial),
  • turning text lines into running text (tutorial),
  • supporting different reading orders (tutorial),
  • reinterpreting and restructuring text regions and lines (tutorial),
  • turning physical structure into logical structure,

USAGE | CONTRIBUTING | LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagexml_tools-0.7.1.tar.gz (65.3 kB view details)

Uploaded Source

Built Distribution

pagexml_tools-0.7.1-py3-none-any.whl (73.1 kB view details)

Uploaded Python 3

File details

Details for the file pagexml_tools-0.7.1.tar.gz.

File metadata

  • Download URL: pagexml_tools-0.7.1.tar.gz
  • Upload date:
  • Size: 65.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.12 Darwin/24.5.0

File hashes

Hashes for pagexml_tools-0.7.1.tar.gz
Algorithm Hash digest
SHA256 bd9a42ed87ea221881d08bc767d72aa5fbd8adb3f04cecf7e0e636657adc48f3
MD5 0ae86848c3c248540916b375df576477
BLAKE2b-256 4b2fe6904b8eefd876dda84f61936d8738c2a6d7bf5f4f34c3ba7f2f53e42e96

See more details on using hashes here.

File details

Details for the file pagexml_tools-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: pagexml_tools-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 73.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.12 Darwin/24.5.0

File hashes

Hashes for pagexml_tools-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 906b4d6994e38a3ac7c86cd5018fddd08245fded2c2ae1d899cb72da4ede851c
MD5 c8f37754498b2da371b94120ba918f6d
BLAKE2b-256 794354cfb83e392af8234404ba357072793904a8985c2875514e21a17f25a107

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page