Skip to main content

Toolset to perform various operations on PAGE XML datasets

Project description

PAGETools - WIP

Small collection of PAGE XML related Python scripts.

Installing

Installation using pip

The suggested method is to install pagetools into a virtual environment using pip:

python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install pagetools

To install the package from source, clone this repository and run inside the project directory

pip install .

Usage

Extraction

Usage: pagetools extract [OPTIONS] XMLS...

  Extract elements as image (optionally with text) files.

Options:
  --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to extract (highest
                                  priority).

  --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to exclude from
                                  extraction (lowest priority).

  --no-text                       Suppresses text extraction.
  -ie, --image-extension TEXT     Extension of image files. Must be in the
                                  same directory as corresponding XML file.

  -o, --output TEXT               Path where generated files will get saved.
  -e, --enumerate-output          Enumerates output file names instead of
                                  using original names.

  -z, --zip-output                Add generated output to zip archive.
  -bg, --background-color INTEGER...
                                  RGB color code used to fill up background.
                                  Used when padding and / or deskewing.

  --background-mode [median|mean|dominant]
                                  Color calc mode to fill up background
                                  (overwrites -bg / --background-color).

  -p, --padding INTEGER...        Padding in pixels around the line image
                                  cutout (top, bottom, left, right).

  -ad, --auto-deskew              Automatically deskew extracted line images
                                  (Experimental!).

  -d, --deskew FLOAT              Angle for manual clockwise rotation of the
                                  line images.

  -gt, --gt-index INTEGER         Index of the TextEquiv elements containing
                                  ground truth.

  -pred, --pred-index INTEGER     Index of the TextEquiv elements containing
                                  predicted text.

  --help                          Show this message and exit.

Examples

Only extract TextLine elements:

pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine

Pay in mind that --include / --exclude currently work different from e.g. the same arguments in rsync (due to limitations with the click library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call.

Regularization

Usage: pagetools regularize [OPTIONS] XMLS...

  Regularize the text content of PAGE XML files using custom rulesets.

Options:
  -r, --rules PATH            File(s) which contains serialized ruleset.
  -s, --safe / -us, --unsafe  Creates backups of original files before
                              overwriting.

  --help                      Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PAGETools-0.2.2.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PAGETools-0.2.2-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file PAGETools-0.2.2.tar.gz.

File metadata

  • Download URL: PAGETools-0.2.2.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for PAGETools-0.2.2.tar.gz
Algorithm Hash digest
SHA256 89f16700a829c1b1ce02209f20323369969faa8fabd0321982dbedc9ce582ace
MD5 95bddf3ba49c1d005b0ae9154bdee39a
BLAKE2b-256 0d6f91f2e5979b3237d8cd6eeb9f8ae08cd1feb0c66fec9c4007e9b035262a31

See more details on using hashes here.

File details

Details for the file PAGETools-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: PAGETools-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for PAGETools-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d973ba778c61f639be23679335b60050447506717ecb27fe60ee43296dd4c97b
MD5 fd44c795c52a074deff2a0b9826955be
BLAKE2b-256 7472606f458063adfe7dadf4c51f137296342ef6748b4b913db99dbde5e5a528

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page