Utility functions for reading PageXML files
Project description
pagexml-tools
Utility functions for reading PageXML files
installing
using poetry
poetry add pagexml-tools
using pip
pip install pagexml-tools
Using
PageXML-tools contains functions for parsing and for a range of analysis tasks.
Parsing PageXML files and the Physical Document model
There is a tutorial that demonstrates the physical document model API
PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.
from pagexml.parser import parse_pagexml_file
pagexml_file = "path/to/pagexml_file.xml"
page_doc = parse_pagexml_file(pagexml_file)
# a page document has an ID
print(page_doc.id)
# print descriptive statistics
print(page_doc.stats)
# iterative over text regions and lines
for tr in page_doc.text_regions:
# a text_region has an ID and a bounding box derived from its coordinates
print(tr.id, tr.coords.box)
# a text_region can have sub-text_regions and lines
for line in tr.lines:
# a line has an ID, coordinates and text
print(line.id, line.coords.box, line.text)
In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file (tutorial),
- searching in text (keyword in context, keywords or fuzzy search)
- classifying physical document types in a large set of PageXML documents (tutorial),
- checking the quality of the HTR/OCR process (tutorial),
- comparing subsets (tutorial),
- identifying document sections in sequences of PageXML documents (tutorial),
- turning text lines into running text (tutorial),
- supporting different reading orders (tutorial),
- reinterpreting and restructuring text regions and lines (tutorial),
- turning physical structure into logical structure,
USAGE | CONTRIBUTING | LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pagexml_tools-0.5.0.tar.gz
(50.2 kB
view hashes)
Built Distribution
Close
Hashes for pagexml_tools-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f04160d0ec197618db98776cd0ced37dc805da8d1d978907e8b8412cb4f5551 |
|
MD5 | 8ceac659da37a26c9ec75c9573e14a95 |
|
BLAKE2b-256 | 93adca93ed859aef4f3f2e8962bb8be049fb719968e7f61f02f16a7faeda6aa3 |