Utility functions for reading PageXML files
Project description
pagexml-tools
Utility functions for reading PageXML files
installing
using poetry
poetry add pagexml-tools
using pip
pip install pagexml-tools
Using
PageXML-tools contains functions for parsing and for a range of analysis tasks.
Parsing PageXML files and the Physical Document model
There is a tutorial that demonstrates the physical document model API
PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.
from pagexml.parser import parse_pagexml_file
pagexml_file = "path/to/pagexml_file.xml"
page_doc = parse_pagexml_file(pagexml_file)
# a page document has an ID
print(page_doc.id)
# print descriptive statistics
print(page_doc.stats)
# iterative over text regions and lines
for tr in page_doc.text_regions:
# a text_region has an ID and a bounding box derived from its coordinates
print(tr.id, tr.coords.box)
# a text_region can have sub-text_regions and lines
for line in tr.lines:
# a line has an ID, coordinates and text
print(line.id, line.coords.box, line.text)
In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file (tutorial),
- searching in text (keyword in context, keywords or fuzzy search)
- classifying physical document types in a large set of PageXML documents (tutorial),
- checking the quality of the HTR/OCR process (tutorial),
- comparing subsets (tutorial),
- identifying document sections in sequences of PageXML documents (tutorial),
- turning text lines into running text (tutorial),
- supporting different reading orders (tutorial),
- reinterpreting and restructuring text regions and lines (tutorial),
- turning physical structure into logical structure,
USAGE | CONTRIBUTING | LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pagexml_tools-0.4.0.tar.gz
(48.5 kB
view hashes)
Built Distribution
Close
Hashes for pagexml_tools-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32c343ddee652a19362cb31adf81f2a93302347c25af273a01d7f0e31d24a647 |
|
MD5 | ecd3c37ec4100a34622bb52a420d0bab |
|
BLAKE2b-256 | 5af32304b92917406324feafa2abf2e20730bc88b44bd26a359c8d70c507ead0 |