Convert AWS Textract JSON to PRImA PAGE XML
Project description
textract2page
Convert AWS Textract JSON to PRImA PAGE XML
Introduction
This software converts OCR results from Amazon AWS Textract Response files to PRImA PAGE XML files.
Installation
In a Python virtualenv:
pip install textract2page
Usage
The package contains a file-based conversion function provided as CLI and Python API.
The function takes the Textract JSON file and the original image file which was used
as input for the OCR. (That is necessary because Textract stores coordinates in
float ratios, whereas PAGE uses int in pixel indices.)
Python API
To convert a Textract file example.json for an image file example.jpg to a PAGE example.xml:
from textract2page import convert_file
convert_file("example.json", "example.jpg", "example.xml")
Alternatively, if you do not have access to the image file, but do know its pixel resolution, use:
from textract2page import convert_file_without_image
convert_file_without_image("example.json",
# just give it a name (will not be read):
"example.jpg",
# set image width so PAGE coordinates will be correct:
2135,
# set image width so PAGE coordinates will be correct:
3240,
"example.xml")
CLI
Analogously, on the command line interface:
# with image file
textract2page example.json example.jpg > example.xml
textract2page -O example.xml example.json example.jpg
# without image file (just its path name)
textract2page --image-width 2135 --image-height 3240 example.json example.jpg > example.xml
textract2page --image-width 2135 --image-height 3240 -O example.xml example.json example.jpg
You can get a list of options with --help or -h
Testing
Requires installation and a local copy of the repository.
To run regression tests with pytest, do
make deps-test
make test-api
To run regression test via command line, do
# optionally:
sudo apt-get install xmlstarlet
make test-cli
(If xmlstarlet is available, then the CLI test will
also validate the result tree. Otherwise, this just
checks the command completes without error.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textract2page-0.3.1.tar.gz.
File metadata
- Download URL: textract2page-0.3.1.tar.gz
- Upload date:
- Size: 36.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fe7108ca9d20d1e664c45b5d46b1a5f725f4bb4a3b48ca4e254b9e238343e04
|
|
| MD5 |
17756504ff5cf28ea29f8604fded5aa0
|
|
| BLAKE2b-256 |
f15934ea84ef19dd18fc57b93650a3ab816fb2143e7f4255832fdc98a40a1d67
|
File details
Details for the file textract2page-0.3.1-py3-none-any.whl.
File metadata
- Download URL: textract2page-0.3.1-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97050284610aa85dfb8b1f117924521f1c140dc630452a51a3be8499ef9fd515
|
|
| MD5 |
c4db024998cd9dab10322075748a16d2
|
|
| BLAKE2b-256 |
98c5091d49942fdb33cf31a2215e005cea00785a9a70e222a21d3825b3fd1058
|