Skip to main content

convert PubLayNet data into METS/PAGE-XML

Project description

ocrd_publaynet

convert PubLayNet data into METS/PAGE-XML

Introduction

This offers OCR-D compliant (i.e. METS-XML/PAGE-XML based) conversion for PubLayNet or similar, MS-COCO-based, ground-truth data.

Installation

System packages

Install GNU make and wget if you wish to use the Makefile.

# on Debian / Ubuntu:
sudo apt install make wget

Install Python3 regardless:

# on Debian / Ubuntu:
sudo apt install python3 python3-pip python3-venv

Equivalently:

# on Debian / Ubuntu:
sudo make deps-ubuntu

Python packages

It is strongly recommended to use venv. You can create and install a virtual environment of your own (which the Makefile will re-use when activated), or have the Makefile do that for you.

pip install -r requirements.txt
pip install .

Equivalently:

make install

Usage

command-line interface ocrd-import-mscoco

Once installed, the following executable becomes available:

Usage: ocrd-import-mscoco [OPTIONS] COCOFILE DIRECTORY

  Convert MS-COCO JSON to METS/PAGE XML files.

  Load JSON ``cocofile`` (in MS-COCO format) and chdir to ``directory``
  (which it refers to).

  Start a METS file mets.xml with references to the image files (under
  fileGrp ``OCR-D-IMG``) and their corresponding PAGE-XML annotations (under
  fileGrp ``OCR-D-GT-SEG-BLOCK``), as parsed from ``cocofile`` and written
  using the same basename.

Options:
  --help  Show this message and exit.

apply on PubLayNet

To apply on the validation subsection:

ocrd-import-mscoco publaynet/val.json publaynet/val

This will create a METS publaynet/val/mets.xml and PAGE files publaynet/val/*.xml for all image files.

To apply on the training subsection:

ocrd-import-mscoco publaynet/train.json publaynet/train

This will create a METS publaynet/train/mets.xml and PAGE files publaynet/train/*.xml for all image files.

Equivalently (including download/extraction if necessary):

make convert

Note: PubLayNet itself requires approximately 103 GB of disk space. If you already have it (elsewhere), but still wish to use the Makefile to convert the files, make sure to symlink it here, so it does not get downloaded twice: ln -s your/path/to/publaynet publaynet

all Makefile targets

Rules to install ocrd-import-mscoco, and to use it on
PubLayNet (by downloading, extracting and converting).

Targets:
	help: this message
	deps-ubuntu: install system dependencies for Ubuntu
	all: alias for `install download convert`
	install: alias for `pip install .`
	download: alias for `publaynet.tar.gz`
	convert: alias for `publaynet/val/mets.xml publaynet/train/mets.xml`
	uninstall: alias for `pip uninstall ocrd_publaynet`
	clean-xml: remove results of conversion (METS and PAGE files in `publaynet`)
	clean: remove `publaynet` altogether

Variables:
	VIRTUAL_ENV: absolute path to (re-)use for the virtual environment
	PYTHON: name of the Python binary
	PIP: name of the Python packaging binary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_publaynet-0.1.0.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

ocrd_publaynet-0.1.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_publaynet-0.1.0.tar.gz.

File metadata

  • Download URL: ocrd_publaynet-0.1.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.9

File hashes

Hashes for ocrd_publaynet-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c0707b30ece563c0746a61ea0faa25c91a6b7a1d53e9de86ddc195ed4327d3e7
MD5 490cd22b9619c5b4a2de92cc8d710ab1
BLAKE2b-256 e5ed745821655d8557daccfbdfb052ebb9cf7ec213888934f79f1ac24c431ef8

See more details on using hashes here.

File details

Details for the file ocrd_publaynet-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ocrd_publaynet-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.9

File hashes

Hashes for ocrd_publaynet-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b475c8240bdd6e8447821c66e11d879f785808af966075a855ff4776ba1effef
MD5 79fae8bcca25f9a587013432c88287c8
BLAKE2b-256 a9fe1bff85bdd521494e73b1186765512b5dd31f74ae918c9fd3bdb72506cd08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page