Skip to main content

Synthetic document rendering with parallel ALTO output

Project description

PangoLine

PangoLine is a basic tool to render raw (horizontal) text into PDF documents and create parallel ALTO files for each page containing baseline and bounding box information.

It is intended to support the rendering of most of the world's writing systems in order to create synthetic page-level training data for automatic text recognition systems. Functionality is fairly basic for now. PDF output is single column, justified text without word breaking. Paragraphs are split automatically once a page is full.

Installation

You'll need PyGObject and the Pango/Cairo libraries on your system. As PyGObject is only shipped in source form this also requires a C compiler and the usual build environment dependencies installed. An easier way is to use conda:

~> conda create --name pangoline-py3.11 -c conda-forge python=3.11
~> conda activate pangoline-py3.11
~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow

Afterwards either install from pypi:

~> pip install pangoline-tool

or directly from the checked out git repository:

~> pip install --no-deps .

Usage

Rendering

PangoLine renders text first into vector PDFs and ALTO facsimiles using some configurable "physical" dimensions.

~> pangoline render doc.txt
Rendering ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Various options to direct rendering such as page size, margins, language, and base direction can be manually set, for example:

~> pangoline render -p 216 279 -l en-us -f "Noto Sans 24" doc.txt

Text can also be styled with Pango Markup. Parsing is disabled per default but can be enabled with a switch:

~> pangoline render --markup doc.txt

It is possible to randomly insert stylization of Unicode word segments in the text. One or more styles will be randomly selected from a configurable list of styles:

~> pangoline render --random-markup-probability 0.01 doc.txt

The probability is the probability of at least one style being applied to any particular segment. A subset of the total available number of styles is enabled by default when a probability greater than 0 is given. To change the list of possible styles:

~> pangoline render --random-markup-probability 0.01 --random-markup style_italic --random-markup variant_smallcaps doc.txt

The semantics of each value can be found in the pango documentation.

Styling with color is treated slightly differently than other styles. In general, colors are selected with the foreground_* style. As a large number of colors are known to Pango, the foreground_random alias exists that enables all possible colors:

~> pangoline render  --random-markup-probability 0.01 --random-markup foreground_random doc.txt

Rasterization

In a second step those vector files can be rasterized into PNGs and the coordinates in the ALTO files scaled to the selected resolution (per default 300dpi):

~> pangoline rasterize doc.0.xml doc.1.xml ...
Rasterizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Rasterized files and their ALTOs can be used as is as ATR training data.

To obtain slightly more realistic input images it is possible to overlay the rasterized text into images of writing surfaces.

~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...

Rasterization can be invoked with multiple background images in which case they will be sampled randomly for each output page. A tarball with 70 empty paper backgrounds of different origins, digitization qualities, and states of preservation can be found here.

For larger collections of texts it is advisable to parallelize processing, especially for rasterization with overlays:

~> pangoline --workers 8 render *.txt
~> pangoline --workers 8 rasterize *.xml

Limitations

In order to achieve proper typesetting quality, Pango requires placing the whole text into a single layout before splitting it into individual pages by translating each line of the layout onto a page surface. This approach limits to maximum print space of a single text to 739.8 meters, roughly 3000 pages depending on paper size and margins, before an overflow of the 32 bit integer baseline position y-offset will occur.

Funding

Co-financed by the European Union This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pangoline_tool-0.2.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pangoline_tool-0.2.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file pangoline_tool-0.2.0.tar.gz.

File metadata

  • Download URL: pangoline_tool-0.2.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pangoline_tool-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ecb3a41e62c013b118e74b7e4537e0599765804a6d4b218bb7f89439656b62c6
MD5 fbc9b1c281057499bae272d27a94dbe5
BLAKE2b-256 2f77b4e9f419407d1c43d3a97c32566f2cbc2d72a93c55c14207502f5a89ac18

See more details on using hashes here.

Provenance

The following attestation bundles were made for pangoline_tool-0.2.0.tar.gz:

Publisher: publish.yml on mittagessen/pangoline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pangoline_tool-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pangoline_tool-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pangoline_tool-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d9bf2cee89a5bb89c2a79b26b4cddf2f3476dabd471506a90eecec19fc4e34c
MD5 a5e4efb281f23bff68fb00a44b2b1cf1
BLAKE2b-256 e34341cdcc7376bf5bfd224c0ab072514bb26ce912ac2d43fd387c93b9165c5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pangoline_tool-0.2.0-py3-none-any.whl:

Publisher: publish.yml on mittagessen/pangoline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page