Skip to main content

A python library for parsing, converting and modifying PageXML files.

Project description

PyPXML

A python library for parsing, creating and modifying PageXML files.

Setup

[!NOTE] Python version >=3.11

Install from PyPI

pip install pypxml

Install upstream from source

  1. Clone repository: git clone https://github.com/jahtz/pypxml
  2. Install package: cd pypxml && pip install .

API

PyPXML provides a feature rich Python API for working with PageXML files.

Full documentation

CLI

$ pypxml --help
Usage: pypxml [OPTIONS] COMMAND [ARGS]...

  A python library for parsing, converting and modifying PageXML files.

Options:
  --help     Show this message and exit.
  --version  Show the version and exit.

Commands:
  get-codec           Extract the character set from PageXML files.
  get-custom          List all custom region attributes in PageXML files.
  get-regions         List all regions in PageXML files.
  get-text            Extract text from PageXML files.
  regularize-codec    Regularize character encodings in PageXML files.
  regularize-regions  Regularize region types in PageXML files.

analytics

get-codec

$ pypxml get-codec --help
Usage: pypxml get-codec [OPTIONS] FILES...

  This tool analyzes the text content of PageXML files and extracts the set of
  characters used.

  It can optionally normalize unicode, remove whitespace, and output character
  frequencies. Results are printed to the console or saved as a CSV file.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output FILE               Path to a CSV file to save the results. If
                                  omitted, results are printed to stdout. If a
                                  directory is given, the file 'codec.csv'
                                  will be created inside it.
  -l, --level [TextRegion|TextLine|Word|Glyph]
                                  PageXML level from which to extract text.
                                  [default: TextLine]
  -i, --index INTEGER             Only consider TextEquiv elements with the
                                  specified index.
  -w, --remove-whitespace         Remove all whitespace characters before
                                  analyzing text.
  -f, --frequencies               Also output character frequencies.
  -n, --normalize [NFC|NFD|NFKC|NFKD]
                                  Normalize unicode before analyzing text.

get-regions

$ pypxml get-regions --help
Usage: pypxml get-regions [OPTIONS] FILES...

  Analyzes PageXML files and lists the region types found.

  Optionally includes subtypes, outputs frequencies, and group by file,
  directory, or globally.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output PATH               CSV file or directory where the results are
                                  saved. If a directory is given, the file
                                  'regions.csv' will be created inside it. If
                                  omitted, results are printed to stdout.
  -l, --level [total|directory|file]
                                  Set the aggregation level for the output.
                                  'total' combines all files, 'directory'
                                  aggregates by parent directory, and 'file'
                                  lists results per individual file.
                                  [default: total]
  -f, --frequencies               Also output the frequency (count) of each
                                  region type.
  -t, --types                     Include subtypes by printing them as
                                  'PageType.type' if available.

get-custom

$ pypxml get-custom --help
Usage: pypxml get-custom [OPTIONS] FILES...

  Analyzes PageXML files and lists the custom region types found.

  Optionally outputs frequencies and group by file, directory, or globally.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output PATH               CSV file or directory where the results are
                                  saved. If a directory is given, the file
                                  'customs.csv' will be created inside it. If
                                  omitted, results are printed to stdout.
  -l, --level [total|directory|file]
                                  Set the aggregation level for the output.
                                  'total' combines all files, 'directory'
                                  aggregates by parent directory, and 'file'
                                  lists results per individual file.
                                  [default: total]
  -f, --frequencies               Also output the frequency (count) of each
                                  custom attribute.

get-text

$ pypxml get-text --help
Usage: pypxml get-text [OPTIONS] FILES...

  Extract text from PageXML files at the TextLine level.

  Outputs to individual text files, a single file, or prints to the console,
  with optional separators between regions and pages.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output PATH            Output destination. If a directory is
                               specified, a separate text file is created for
                               each PageXML file, ignoring the page separator.
                               If a file is specified, the text from all files
                               is concatenated into that file. If omitted, the
                               text is printed to stdout.
  -i, --index INTEGER          Use only the text from TextEquiv elements at
                               the given index.
  -r, --region-separator TEXT  Separator string inserted between regions. Use
                               "" for an empty line, "\n" for two empty lines,
                               ...
  -p, --page-separator TEXT    Separator string inserted between pages when
                               outputting to a single file or stdout. Ignored
                               when outputting multiple files. Use "" for an
                               empty line, "\n" for two empty lines, ...

regularize

regularize-codec

$ pypxml regularize-codec --help
Usage: pypxml regularize-codec [OPTIONS] FILES...

  Apply character replacement rules to text elements in PageXML files.

  Supports selecting PlainText or Unicode elements and limiting replacements
  to specific element levels.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output DIRECTORY          Directory to save the modified PageXML
                                  files. If omitted, input files will be
                                  overwritten.
  -i, --index INTEGER             Use only TextEquiv elements with the
                                  specified index. Defaults to all TextEquiv
                                  elements if not set.
  -l, --level [TextRegion|TextLine|Word|Glyph]
                                  PageXML element level to process.  [default:
                                  TextLine]
  --plaintext / --unicode         Select the text element to use.Choose from
                                  PlainText (without formatting) or Unicode
                                  (formatted).  [default: unicode]
  -r, --rule TEXT...              Define substring replacement rules. Each
                                  rule is a pair of strings: '--rule SOURCE
                                  TARGET'. Multiple rules can be specified by
                                  repeating the option.  [required]

regularize-regions

$ pypxml regularize-regions --help
Usage: pypxml regularize-regions [OPTIONS] FILES...

  This tool processes PageXML files and updates or removes regions based on
  specified rules.

  Regions are matched by their PageType and optional subtype. Regions matching
  the source specification are either updated to a new type or deleted if
  target is set to 'None'.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output DIRECTORY  Directory to save the modified PageXML files. If
                          omitted, input files will be overwritten.
  -r, --rule TEXT...      Define rules for region regularization. Format:
                          --rule SOURCE TARGET where SOURCE is the original
                          region type (e.g., TextRegion.paragraph,
                          ImageRegion), and TARGET is the new region type. Use
                          an 'None' TARGET to delete the region. Only region
                          PageTypes are allowed. Multiple rules can be
                          specified by repeating this option.  [required]

ZPD

Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypxml-4.3.1.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pypxml-4.3.1-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file pypxml-4.3.1.tar.gz.

File metadata

  • Download URL: pypxml-4.3.1.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for pypxml-4.3.1.tar.gz
Algorithm Hash digest
SHA256 e0304366eae71810f5136e29ec941156724c25b6f63cc686b002f9fddc62d5f3
MD5 eddef99e82c119c43384fdab7c106fd8
BLAKE2b-256 ef3108ac10bfef1554f634cbda45f451875ac48bbd5d435b192bc1f20a05496e

See more details on using hashes here.

File details

Details for the file pypxml-4.3.1-py3-none-any.whl.

File metadata

  • Download URL: pypxml-4.3.1-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for pypxml-4.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1df20ff2c7b022492f3622d9af4828f85a35a925e14d3cf6224fa112c2129253
MD5 cef5471057414231a2fe9d2e5d59185c
BLAKE2b-256 88d13cbb192b4e365cf4d66997febabeaa7c90c5883e520a6ce6ce39f9df8611

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page