Utility functions for reading PageXML files
Project description
pagexml-tools
Utility functions for reading PageXML files
installing
using poetry
poetry add pagexml-tools
using pip
pip install pagexml-tools
Using
PageXML-tools contains functions for parsing and for a range of analysis tasks.
Parsing PageXML files and the Physical Document model
There is a tutorial that demonstrates the physical document model API
PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.
from pagexml.parser import parse_pagexml_file
pagexml_file = "path/to/pagexml_file.xml"
page_doc = parse_pagexml_file(pagexml_file)
# a page document has an ID
print(page_doc.id)
# print descriptive statistics
print(page_doc.stats)
# iterative over text regions and lines
for tr in page_doc.text_regions:
# a text_region has an ID and a bounding box derived from its coordinates
print(tr.id, tr.coords.box)
# a text_region can have sub-text_regions and lines
for line in tr.lines:
# a line has an ID, coordinates and text
print(line.id, line.coords.box, line.text)
In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file (tutorial),
- searching in text (keyword in context, keywords or fuzzy search)
- classifying physical document types in a large set of PageXML documents (tutorial),
- checking the quality of the HTR/OCR process (tutorial),
- comparing subsets (tutorial),
- identifying document sections in sequences of PageXML documents (tutorial),
- turning text lines into running text (tutorial),
- supporting different reading orders (tutorial),
- reinterpreting and restructuring text regions and lines (tutorial),
- turning physical structure into logical structure,
USAGE | CONTRIBUTING | LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pagexml_tools-0.5.0.tar.gz
.
File metadata
- Download URL: pagexml_tools-0.5.0.tar.gz
- Upload date:
- Size: 50.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.8 Darwin/23.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fbda5e390a5c0199bb968217f77ddb7edc01cbdbc72c6beceac496128f413aa |
|
MD5 | d211734cdd8e2ff1b5682e4f990ed9c7 |
|
BLAKE2b-256 | b9d82692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a |
File details
Details for the file pagexml_tools-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: pagexml_tools-0.5.0-py3-none-any.whl
- Upload date:
- Size: 54.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.8 Darwin/23.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f04160d0ec197618db98776cd0ced37dc805da8d1d978907e8b8412cb4f5551 |
|
MD5 | 8ceac659da37a26c9ec75c9573e14a95 |
|
BLAKE2b-256 | 93adca93ed859aef4f3f2e8962bb8be049fb719968e7f61f02f16a7faeda6aa3 |