Synthetic document rendering with parallel ALTO output
Project description
PangoLine
PangoLine is a basic tool to render raw (horizontal) text into PDF documents and create parallel ALTO files for each page containing baseline and bounding box information.
It is intended to support the rendering of most of the world's writing systems in order to create synthetic page-level training data for automatic text recognition systems. Functionality is fairly basic for now. PDF output is single column, justified text without word breaking. Paragraphs are split automatically once a page is full.
Installation
You'll need PyGObject and the Pango/Cairo libraries on your system. As PyGObject is only shipped in source form this also requires a C compiler and the usual build environment dependencies installed. An easier way is to use conda:
~> conda create --name pangoline-py3.11 -c conda-forge python=3.11
~> conda activate pangoline-py3.11
~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow
~> pip install --no-deps .
Usage
Rendering
PangoLine renders text first into vector PDFs and ALTO facsimiles using some configurable "physical" dimensions.
~> pangoline render doc.txt
Rendering ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Various options to direct rendering such as page size, margins, language, and base direction can be manually set, for example:
~> pangoline render -p 216 279 -l en-us -f "Noto Sans 24" doc.txt
Text can also be styled with Pango Markup. Parsing is enabled per default but can be disabled with a switch:
~> pangoline render --no-markup doc.txt
Rasterization
In a second step those vector files can be rasterized into PNGs and the coordinates in the ALTO files scaled to the selected resolution (per default 300dpi):
~> pangoline rasterize doc.0.xml doc.1.xml ...
Rasterizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Rasterized files and their ALTOs can be used as is as ATR training data.
To obtain slightly more realistic input images it is possible to overlay the rasterized text into images of writing surfaces.
~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...
Rasterization can be invoked with multiple background images in which case they will be sampled randomly for each output page. A tarball with 70 empty paper backgrounds of different origins, digitization qualities, and states of preservation can be found here.
For larger collections of texts it is advisable to parallelize processing, especially for rasterization with overlays:
~> pangoline --workers 8 render *.txt
~> pangoline --workers 8 rasterize *.xml
Funding
| |
This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829). |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pangoline_tool-0.1.0.tar.gz.
File metadata
- Download URL: pangoline_tool-0.1.0.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cd270d121df9db08952d0f486ee6baa5f489f6f51c2eb8503e6859de68bd806
|
|
| MD5 |
553597e314db77a6b214d9973eb1991a
|
|
| BLAKE2b-256 |
91033a5068bd4550e39add263d9e7d8d48649860c8198cb9ecb9e5b1fc4a0f38
|
Provenance
The following attestation bundles were made for pangoline_tool-0.1.0.tar.gz:
Publisher:
publish.yml on mittagessen/pangoline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pangoline_tool-0.1.0.tar.gz -
Subject digest:
3cd270d121df9db08952d0f486ee6baa5f489f6f51c2eb8503e6859de68bd806 - Sigstore transparency entry: 200169082
- Sigstore integration time:
-
Permalink:
mittagessen/pangoline@32ce778d8c0b6c8e0a591c8ebdecb963ff3dc771 -
Branch / Tag:
refs/tags/0.1 - Owner: https://github.com/mittagessen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@32ce778d8c0b6c8e0a591c8ebdecb963ff3dc771 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pangoline_tool-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pangoline_tool-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75a3e4fb7a7b00186b880e3856f144ea32324ba1e59b4f69210a191cef0cbeab
|
|
| MD5 |
40a77a6962121087d6f9a555c5670a64
|
|
| BLAKE2b-256 |
ce9c568fed8817bd1fb46ecab7aa6a68c968bb7437c8e52561ed77cc4956d69c
|
Provenance
The following attestation bundles were made for pangoline_tool-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on mittagessen/pangoline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pangoline_tool-0.1.0-py3-none-any.whl -
Subject digest:
75a3e4fb7a7b00186b880e3856f144ea32324ba1e59b4f69210a191cef0cbeab - Sigstore transparency entry: 200169084
- Sigstore integration time:
-
Permalink:
mittagessen/pangoline@32ce778d8c0b6c8e0a591c8ebdecb963ff3dc771 -
Branch / Tag:
refs/tags/0.1 - Owner: https://github.com/mittagessen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@32ce778d8c0b6c8e0a591c8ebdecb963ff3dc771 -
Trigger Event:
push
-
Statement type: