Skip to main content

Synthetic document rendering with parallel ALTO output

Project description

PangoLine

PangoLine is a basic tool to render raw (horizontal) text into PDF documents and create parallel ALTO files for each page containing baseline and bounding box information.

It is intended to support the rendering of most of the world's writing systems in order to create synthetic page-level training data for automatic text recognition systems. Functionality is fairly basic for now. PDF output is single column, justified text without word breaking. Paragraphs are split automatically once a page is full.

Installation

You'll need PyGObject and the Pango/Cairo libraries on your system. As PyGObject is only shipped in source form this also requires a C compiler and the usual build environment dependencies installed. An easier way is to use conda:

~> conda create --name pangoline-py3.11 -c conda-forge python=3.11
~> conda activate pangoline-py3.11
~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow
~> pip install --no-deps .

Usage

Rendering

PangoLine renders text first into vector PDFs and ALTO facsimiles using some configurable "physical" dimensions.

~> pangoline render doc.txt
Rendering ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Various options to direct rendering such as page size, margins, language, and base direction can be manually set, for example:

~> pangoline render -p 216 279 -l en-us -f "Noto Sans 24" doc.txt

Text can also be styled with Pango Markup. Parsing is enabled per default but can be disabled with a switch:

~> pangoline render --no-markup doc.txt

Rasterization

In a second step those vector files can be rasterized into PNGs and the coordinates in the ALTO files scaled to the selected resolution (per default 300dpi):

~> pangoline rasterize doc.0.xml doc.1.xml ...
Rasterizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Rasterized files and their ALTOs can be used as is as ATR training data.

To obtain slightly more realistic input images it is possible to overlay the rasterized text into images of writing surfaces.

~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...

Rasterization can be invoked with multiple background images in which case they will be sampled randomly for each output page. A tarball with 70 empty paper backgrounds of different origins, digitization qualities, and states of preservation can be found here.

For larger collections of texts it is advisable to parallelize processing, especially for rasterization with overlays:

~> pangoline --workers 8 render *.txt
~> pangoline --workers 8 rasterize *.xml

Funding

Co-financed by the European Union This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pangoline_tool-0.1.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pangoline_tool-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file pangoline_tool-0.1.0.tar.gz.

File metadata

  • Download URL: pangoline_tool-0.1.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pangoline_tool-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3cd270d121df9db08952d0f486ee6baa5f489f6f51c2eb8503e6859de68bd806
MD5 553597e314db77a6b214d9973eb1991a
BLAKE2b-256 91033a5068bd4550e39add263d9e7d8d48649860c8198cb9ecb9e5b1fc4a0f38

See more details on using hashes here.

Provenance

The following attestation bundles were made for pangoline_tool-0.1.0.tar.gz:

Publisher: publish.yml on mittagessen/pangoline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pangoline_tool-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pangoline_tool-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pangoline_tool-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75a3e4fb7a7b00186b880e3856f144ea32324ba1e59b4f69210a191cef0cbeab
MD5 40a77a6962121087d6f9a555c5670a64
BLAKE2b-256 ce9c568fed8817bd1fb46ecab7aa6a68c968bb7437c8e52561ed77cc4956d69c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pangoline_tool-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mittagessen/pangoline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page