convert PubLayNet data into METS/PAGE-XML
Project description
ocrd_publaynet
convert PubLayNet data into METS/PAGE-XML
Introduction
This offers OCR-D compliant (i.e. METS-XML/PAGE-XML based) conversion for PubLayNet or similar, MS-COCO-based, ground-truth data.
Installation
System packages
Install GNU make
and wget
if you wish to use the Makefile.
# on Debian / Ubuntu:
sudo apt install make wget
Install Python3 regardless:
# on Debian / Ubuntu:
sudo apt install python3 python3-pip python3-venv
Equivalently:
# on Debian / Ubuntu:
sudo make deps-ubuntu
Python packages
It is strongly recommended to use venv. You can create and install a virtual environment of your own (which the Makefile will re-use when activated), or have the Makefile do that for you.
pip install -r requirements.txt
pip install .
Equivalently:
make install
Usage
command-line interface ocrd-import-mscoco
Once installed, the following executable becomes available:
Usage: ocrd-import-mscoco [OPTIONS] COCOFILE DIRECTORY
Convert MS-COCO JSON to METS/PAGE XML files.
Load JSON ``cocofile`` (in MS-COCO format) and chdir to ``directory``
(which it refers to).
Start a METS file mets.xml with references to the image files (under
fileGrp ``OCR-D-IMG``) and their corresponding PAGE-XML annotations (under
fileGrp ``OCR-D-GT-SEG-BLOCK``), as parsed from ``cocofile`` and written
using the same basename.
Options:
--help Show this message and exit.
apply on PubLayNet
To apply on the validation subsection:
ocrd-import-mscoco publaynet/val.json publaynet/val
This will create a METS publaynet/val/mets.xml
and PAGE files publaynet/val/*.xml
for all image files.
To apply on the training subsection:
ocrd-import-mscoco publaynet/train.json publaynet/train
This will create a METS publaynet/train/mets.xml
and PAGE files publaynet/train/*.xml
for all image files.
Equivalently (including download/extraction if necessary):
make convert
Note: PubLayNet itself requires approximately 103 GB of disk space. If you already have it (elsewhere), but still wish to use the Makefile to convert the files, make sure to symlink it here, so it does not get downloaded twice:
ln -s your/path/to/publaynet publaynet
all Makefile targets
Rules to install ocrd-import-mscoco, and to use it on
PubLayNet (by downloading, extracting and converting).
Targets:
help: this message
deps-ubuntu: install system dependencies for Ubuntu
all: alias for `install download convert`
install: alias for `pip install .`
download: alias for `publaynet.tar.gz`
convert: alias for `publaynet/val/mets.xml publaynet/train/mets.xml`
uninstall: alias for `pip uninstall ocrd_publaynet`
clean-xml: remove results of conversion (METS and PAGE files in `publaynet`)
clean: remove `publaynet` altogether
Variables:
VIRTUAL_ENV: absolute path to (re-)use for the virtual environment
PYTHON: name of the Python binary
PIP: name of the Python packaging binary
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ocrd_publaynet-0.1.0.tar.gz
.
File metadata
- Download URL: ocrd_publaynet-0.1.0.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0707b30ece563c0746a61ea0faa25c91a6b7a1d53e9de86ddc195ed4327d3e7 |
|
MD5 | 490cd22b9619c5b4a2de92cc8d710ab1 |
|
BLAKE2b-256 | e5ed745821655d8557daccfbdfb052ebb9cf7ec213888934f79f1ac24c431ef8 |
File details
Details for the file ocrd_publaynet-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: ocrd_publaynet-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b475c8240bdd6e8447821c66e11d879f785808af966075a855ff4776ba1effef |
|
MD5 | 79fae8bcca25f9a587013432c88287c8 |
|
BLAKE2b-256 | a9fe1bff85bdd521494e73b1186765512b5dd31f74ae918c9fd3bdb72506cd08 |