Skip to main content

Wrapper for the PageXML C++ library to ease handling of Page XML files within python.

Project description

The py-pagexml python package is a library of functions that eases working with omni:us Pages Format XML files (referred to as OPF XML files). It allows you from python to read an OPF file, extract information contained within, modify or add content, create an OPF from scratch, crop parts for page images, etc.

By default OPF XML files are validated against the XSD schema when reading, saving, or on request by calling a function. The documentation the XSD schema and the schema itself included in py-pagexml can be found at:

The official online documentation of py-pagexml is available at https://omni-us.github.io/pagexml/py-pagexml.

The py-pagexml package can be built with two modes: normal and slim. As the name implies, the slim build is smaller but more importantly it has less library dependencies. This also means that there are some features which are not available, namely: functions related to images, e.g. PageXML.crop; and functions that perform intersections of polygons, e.g. PageXML.selectByOverlap.

Software dependencies

The core of py-pagexml is a compiled C++ library that links with a few libraries, so it requires installation of dependencies that cannot be automatically obtained from pypi servers.

There are docker images available at docker hub which include both the runtime and the build dependencies already installed. In particular the runtime docker images are intended to be used as base images for applications that use pagexml. The specific list of dependencies both for runtime and building are listed below.

Runtime dependencies

Slim:
  • python3

Normal (in addition to the previous):
  • libopencv-imgcodecs (Ubuntu 18.04/20.04) | libopencv-highgui (Ubuntu 16.04)

  • libopencv-imgproc

  • libopencv-core

  • libgdal

Building dependencies

Slim:
  • python3-setuptools

  • python3-pkgconfig

  • python3-wheel

  • python3-dev

  • swig

Normal (in addition to the previous):
  • libopencv-dev

  • libgdal-dev

  • libboost-all-dev

Installation from wheel binary file

If you have configured a pypi server that includes pagexml, installation is as simple as:

pip3 install pagexml

The slim build has a different name, thus the install comand would be:

pip3 install pagexml_slim

Otherwise you can install it from a github release. Each release includes multiple wheel files. One for python 3.5 which is built for Ubuntu 16.04, another for python 3.6 built for Ubuntu 18.04 and another for python 3.8 built for Ubuntu 20.04. Once you have located the appropriate wheel file, copy the link and run as follows replacing the URL with the one you copied:

pip3 install https://github.com/omni-us/pagexml/releases/download/20*/pagexml-20*-linux_x86_64.whl

Building the wheel file from source

Clone the github repository https://github.com/omni-us/pagexml.git, go to the py-pagexml directory and then run:

pip3 install --editable .[dev]
./setup.py bdist_wheel

To build the slim package, give the --slim command line option, e.g.:

./setup.py bdist_wheel --slim

Simple usage examples

Create a new Page XML adding regions, text and properties

import pagexml
pxml = pagexml.PageXML()

# Create a new page xml
file = 'example_image.jpg'
width = 400
height = 200
pxml.newXml('name-and-version-of-tool', file, width, height)

# Add a text region to the Page
page = pxml.selectNth('//_:Page', 0)
reg = pxml.addTextRegion(page)

# Set text region bounding box with a confidence
pxml.setCoordsBBox(reg, 10, 20, 80, 60, 0.8)

# Set the text for the text region with a confidence
pxml.setTextEquiv(reg, 'lorem ipsum', 0.9)

# Add property to text region
pxml.setProperty(reg, 'key', 'value')

# Add a second page with a text region and specific id
page = pxml.addPage('example_image_2.jpg', 300, 300)
reg = pxml.addTextRegion(page, 'regA')
pxml.setCoordsBBox(reg, 15, 12, 76, 128)

# Write XML to file
pxml.write('example_image.xml')

Modify an existing Page XML

# Load an existing XML
import pagexml
pxml = pagexml.PageXML('example_image.xml')

# Add content to loaded XML
pxml.setProperty(pxml.selectNth('//_:Page', 0), 'key', 'value')

# Write XML to file
pxml.write('example_image_2.xml')

Crop an element and save image to disk

# Load an existing XML
import pagexml
pxml = pagexml.PageXML('examples/lorem.xml')

# Crop element with specific ID
cropped = pxml.crop('//*[@id="r1_l1"]/_:Coords')[0]

# Save image to disk
pagexml.imwrite(cropped.name+'.png', cropped.image)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pagexml_slim-2022.4.12-cp313-cp313-manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

pagexml_slim-2022.4.12-cp312-cp312-manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

pagexml_slim-2022.4.12-cp311-cp311-manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

pagexml_slim-2022.4.12-cp310-cp310-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

pagexml_slim-2022.4.12-cp39-cp39-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

pagexml_slim-2022.4.12-cp38-cp38-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.24+ x86-64

pagexml_slim-2022.4.12-cp37-cp37m-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.24+ x86-64

pagexml_slim-2022.4.12-cp36-cp36m-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.24+ x86-64

File details

Details for the file pagexml_slim-2022.4.12-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b70d61036f036fb3a119e4304d9a71b49f45d21d5d673ccbdc850bec7136fafd
MD5 294ac0f3541e0f296c425f3e1e6a70a6
BLAKE2b-256 ea4ce758c70cfaf9606cb81c33a83760b8bb39070b506abb2e172587e4db7dbf

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 67c5c3b580cdca888557ee84ddba2a90457040686c77ab39bdc1f8110bac6a19
MD5 5830ca6f375054d9d985c824fe596a8d
BLAKE2b-256 6e44d6415719785420e6b4bbf24f1432269fbf2bb3f547fda15db59d1846a20d

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9ce82cfc8b5cdb86ea266d268fa396e5b2fc4eef79f33c1b0ad03755e89323bc
MD5 85a033c2db41d48403e84fbb7b65ec9e
BLAKE2b-256 55c41246d8002197261abe9fcef98917848ac45af5d746d8d436fcf379326f13

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 4de0fdf955f542678c37e9666f0a2d625e03177f280a30c974a9b2096dde4baf
MD5 9145a1366e9a2c7a0c51cb4082fdfcd9
BLAKE2b-256 0d5dc30d44f4a96fd6730311fa4aabde3e5003000ce0e33aad88c76039336da7

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 0851ade446a9dbba6bdf93687af24a6589f264e50077399b6cfeb957eee840d6
MD5 d24d334f81b9b8c42c50e02432d47d13
BLAKE2b-256 f0589ee364bcad0d3b74649a6f73852084f660ddc03f129c458ddeb63738475b

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 688c1668f16f4a7eee540aecfb86f2b0485d18d1462c8c8286060fd20b118d2d
MD5 e65e5157e8994c8cc1cb39bfa5b66250
BLAKE2b-256 743f9758f878c11b83cd6b5a3a20cb7413c30f7a7e5ffbd75de2ca22ff698aec

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp37-cp37m-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp37-cp37m-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 017aa8e428fc5f2b0564186aa8fb8b72056ced59348698a865161b48d29511a1
MD5 2530d6c27e42594d8a437f919dd969e3
BLAKE2b-256 a23fd1a248ff1ca7bac2740857f3bc1d5fefc754f3a2e41e6bb1dba9014e3924

See more details on using hashes here.

File details

Details for the file pagexml_slim-2022.4.12-cp36-cp36m-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pagexml_slim-2022.4.12-cp36-cp36m-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 b6e0bcfe248e1beedea6cc1b7c7ef8b7126e62627ea21b8c9f44bbc803463e70
MD5 0c8ec9dea88db75b4f6d2a38c8836eb4
BLAKE2b-256 a148e90a51466a7d6b891918730422ba1fe3e4d5b325abe2813eec359edf6642

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page