Toolset to perform various operations on PAGE XML datasets
Project description
Small collection of PAGE XML related Python scripts used at the Centre for Philology and Digitality (ZPD), University of Würzburg.
Installing
Installation using pip
The suggested method is to install pagetools
into a virtual environment using pip:
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install pagetools
To install the package from source, clone this repository and run inside the project directory
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install .
Usage
Transformations
Extraction
Usage: pagetools extract [OPTIONS] XMLS...
Extract elements as image (optionally with text) files.
Options:
--include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
PAGE XML element types to extract (highest
priority).
--exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
PAGE XML element types to exclude from
extraction (lowest priority).
--no-text Suppresses text extraction.
-ie, --image-extension TEXT Extension of image files. Must be in the
same directory as corresponding XML file.
-o, --output TEXT Path where generated files will get saved.
-e, --enumerate-output Enumerates output file names instead of
using original names.
-z, --zip-output Add generated output to zip archive.
-bg, --background-color INTEGER...
RGB color code used to fill up background.
Used when padding and / or deskewing.
--background-mode [median|mean|dominant]
Color calc mode to fill up background
(overwrites -bg / --background-color).
-p, --padding INTEGER... Padding in pixels around the line image
cutout (top, bottom, left, right).
-ad, --auto-deskew Automatically deskew extracted line images
(Experimental!).
-d, --deskew FLOAT Angle for manual clockwise rotation of the
line images.
-gt, --gt-index INTEGER Index of the TextEquiv elements containing
ground truth.
-pred, --pred-index INTEGER Index of the TextEquiv elements containing
predicted text.
--help Show this message and exit.
Examples
Only extract TextLine
elements:
pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*"
Pay in mind that --include / --exclude currently work different from e.g. the same arguments in rsync
(due to limitations with the click
library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call.
line2page
Merges line images with corresponding text-files in page-images and page-xml
Usage: pagetools line2page [OPTIONS]
Links line images and corresponding texts in a page and creates a combined
image and XML-File of each page
Options:
-c, --creator TEXT Creator tag for PAGE XML
-s, --source-folder TEXT Path to images and GT [required]
-i, --image-folder TEXT Path to images
-gt, --gt-folder TEXT Path to GT
-d, --dest-folder TEXT Path to merge objects
-e, --ext TEXT Image extension
-p, --pred BOOLEAN Set flag to also store .pred.txt
-l, --lines INTEGER RANGE Lines per page
-ls, --line-spacing INTEGER RANGE
Spacing between lines in pixel
-b, --border INTEGER RANGE... Border in pixel: top bottom left right
--debug [10|20|30|40|50] Sets the level of feedback to receive:
DEBUG=10, INFO=20, WARNING=30, ERROR=40,
CRITICAL=50
--threads INTEGER RANGE Thread count to be used
--xml-schema [17|19] Sets the year of the xml-Schema to be used
--help Show this message and exit.
Please note that each image file has to have the same name as its Ground Truth file.
foo.nrm.png -> foo.gt.txt (& foo.pred.txt)
bar.bin.png -> bar.gt.txt (& bar.pred.txt)
Regularization
Usage: pagetools regularize [OPTIONS] XMLS...
Regularize the text content of PAGE XML files using custom rulesets.
Options:
--remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
Removes specified default ruleset.
--add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
Adds specified default ruleset. Overrides
all other default options.
-nd, --no-default Disables all default rulesets.
-r, --rules PATH File(s) which contains serialized ruleset.
-nu, --normalize-unicode [NFC|NFD|NFKC|NFKD]
Normalize unicode for both rules and PAGE
XML tests.
-s, --safe / -us, --unsafe Creates backups of original files before
overwriting.
--help Show this message and exit.
Change index
Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET
Change index on TextEquiv elements.
Options:
-s, --safe / -us, --unsafe Creates backups of original files before
overwriting.
--help Show this message and exit.
Analytics
Get Codec
Usage: pagetools get-codec [OPTIONS] FILES...
Retrieves codec of PAGE XML files.
Options:
-l, --level [region|line|word|glyph]
-idx, --index INTEGER Considers only text from TextEquiv elements
with a certain index.
-mc, --most-common INTEGER Only prints n most common entries. Shows all
by default.
-o, --output TEXT File to which results are written.
-rw, --remove-whitespace
-of, --output-format [json|csv|txt]
Available result formats.
-freq, --frequencies Outputs character frequencies.
--text-output-newline Inserts new line after every character in
txt output. Only applies when frequencies
aren't output.
--verbose / --silent Choose between verbose or silent output.
--help Show this message and exit.
Get text count
Usage: pagetools get-text-count [OPTIONS] FILES...
Returns the amount of text equiv elements in certain elements for certain
indices.
Options:
-e, --element [TextRegion|TextLine|Word]
-i, --index TEXT [required]
-so, --stats-out TEXT Output directory for detailed stats csv
file.
--help Show this message and exit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PAGETools-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5e8595e4c14528ddc2debfa4cd621223f184341dd0ced8ece43cf5fe3e6393b |
|
MD5 | 4e05ff6ae9ecaaa1c996a3bc96a61220 |
|
BLAKE2b-256 | 5e5d1d933f31ce09b25b6f223cf1a32db7acb9b2842772925cd308fc2a801f24 |