A synthetic data generator for training OCR models

These details have not been verified by PyPI

Project links

Project description

Build Status

docTR-Synth-Generator

A tool to generate synthetic OCR text recognition datasets - made for docTR

Features

Zero-config: generate a dataset with nothing but an output directory - real words, matching fonts and background images are downloaded automatically.
Multilingual by language code: languages=["de", "ru", "ar", ...] resolves both the words and the fonts for each script (~85 languages), with correct complex-script shaping and right-to-left layout for Arabic/Hebrew.
No more dropped words: any character a local font cannot render triggers an on-demand download of a font that can, instead of silently skipping the word.
Realistic output: supersampled anti-aliasing, background-aware ink colour and contrast (dark-on-light and light-on-dark), faux-bold/outlines, and scanner/camera-style degradations (JPEG artifacts, sensor noise, blur).
Controllable balancing: explicit per-language allocation, a stratified train/val split, optional character-coverage guarantees, and a balance report.
Recognition and detection: produce word/line crops for recognition, or full document-like pages with per-word polygons for detection - both in the formats docTR's training references expect.
Fast & memory-bounded: font objects and decoded backgrounds are cached, with a configurable cache size.

Quickstart (zero configuration)

You no longer need to provide a wordlist or a font directory. With nothing but an output directory and a count, the generator downloads real words for the requested language(s) and automatically fetches matching open-source fonts:

from generator import GenerationConfig, SyntheticDatasetGenerator

config = GenerationConfig(output_dir="output_dataset", num_images=1000)  # English by default
SyntheticDatasetGenerator(config).generate_dataset()

Multilingual is a one-liner - a language code selects both its words and its script, so the correct fonts are pulled in for you:

config = GenerationConfig(
    output_dir="output_dataset",
    num_images=10000,
    languages=["en", "de", "ru", "el", "ar"],  # words + fonts resolved automatically
    bg_image_dir="resources/background_images",  # optional; blank backgrounds otherwise
)
SyntheticDatasetGenerator(config).generate_dataset()

The first run downloads word lists and fonts from public mirrors and caches them (corpus_cache_dir / font_cache_dir). Subsequent runs are offline. To run fully offline from the start, supply your own wordlist_path and font_dir.

Bring your own resources (classic usage)

Supplying a wordlist_path and/or font_dir still works and takes precedence over the automatic downloads:

config = GenerationConfig(
    wordlist_path="resources/corpus/latin_ext_balanced_words.txt",
    font_dir="resources/font",  # e.g. the extracted fonts_v1 release
    bg_image_dir="resources/background_images",  # bundled with the repo
    output_dir="output_dataset",
    num_images=1000,
    val_percent=0.2,
    num_workers=6,
    # If a word contains characters none of your local fonts cover, download a
    # matching font instead of dropping the word (default: True):
    auto_download_fonts=True,
)
SyntheticDatasetGenerator(config).generate_dataset()

Automatic fonts

When no local font covers every character of a word, a matching open-source font (from the Noto family, which spans the whole Unicode range) is downloaded, verified for coverage and cached. This prevents words from being silently skipped - the main cause of biased, latin-only datasets. Disable with auto_download_fonts=False.

Automatic words

When no wordlist_path is given, real frequency-ranked words for languages are downloaded (from the open FrequencyWords project, ~85 languages) and cleaned (script filtering, length bounds, punctuation removal). Two realism helpers are applied by default and can be tuned or disabled:

casing_variant_prob (0.3): adds Title/UPPERCASE variants so the model sees capital letters (frequency lists are almost all lowercase).
numeric_token_ratio (0.05): mixes in realistic numbers, dates, prices and codes - the kind of content real documents are full of.

Automatic backgrounds

When no bg_image_dir is given, a curated set of background images is downloaded and cached automatically (instead of producing blank backgrounds). Supplying your own bg_image_dir takes precedence and skips the download entirely - exactly like fonts and word lists. Disable with auto_download_backgrounds=False, point background_cache_dir somewhere persistent, or pass a background_manifest_url (a newline-separated list of filenames/URLs) to use a different collection.

Dataset balancing

For multilingual runs the language mix is explicit and controllable instead of being dominated by whichever language has the most words:

config = GenerationConfig(
    output_dir="output_dataset",
    num_images=30000,
    languages=["en", "de", "ru"],
    language_balance="balanced",  # "balanced" (default) or "proportional"
    # language_weights={"en": 0.6, "de": 0.3, "ru": 0.1},  # or set explicit weights
    min_char_coverage=20,  # ensure every character appears >= N times (0 = off)
)

The split is stratified: train and val share the same language mix and exact words do not leak from train into val. A balance report is printed before generation (per-language train/val counts, train/val overlap, distinct/rare characters, word-length statistics); silence it with print_balance_report=False.

Detection datasets

Set task="detection" to generate document-like pages with a 4-point polygon for every word, ready for docTR detection training:

config = GenerationConfig(
    task="detection",
    output_dir="detection_dataset",
    num_images=5000,  # = number of pages
    languages=["en", "de"],  # words + fonts resolved automatically
    bg_image_dir="resources/background_images",
    output_jpeg=True,
)
SyntheticDatasetGenerator(config).generate_dataset()

Each split is written as images/ plus a labels.json in the exact docTR format (absolute pixel coordinates):

{
  "00000.jpg": {
    "img_dimensions": [1462, 1056],
    "img_hash": "<sha256 of the image>",
    "polygons": [[[x1, y1], [x2, y2], [x3, y3], [x4, y4]], ...]
  }
}

It reuses the same fonts, ink styling, contrast, backgrounds and degradations as the recognition path, and lays words out in paragraph blocks with margins, line wrapping, occasional headings/indents, numbers and dates, and an optional small global page rotation (the polygons rotate with the page, giving rotated boxes usable with docTR's use_polygons=True). Tune layout with the det_* config fields (det_page_*_range, det_font_size_range, det_max_words_per_page, det_max_blocks, det_rotation_*, ...). Pages are filled top-to-bottom by the available vertical space, so word count varies naturally with font size.

Backgrounds for detection: only the words you place are labelled, so any text already printed in a background photo becomes an unlabelled false negative. det_plain_background_prob (0.4) mixes in clean generated paper; set it to 1.0 for all-paper pages, or point bg_image_dir at text-free textures (plain paper, surfaces, fabrics) only.

Non-Latin scripts work out of the box: words and fonts are resolved per language, complex scripts are shaped correctly (Arabic joining, Indic conjuncts), and right-to-left languages (Arabic, Hebrew, ...) are laid out right-to-left so pages read naturally. For example languages=["ar"], ["he"], ["zh"] or ["hi"].

Realism

Rendered crops are meant to match real captured documents rather than clean synthetic glyphs. The pipeline applies, all configurable:

Supersampled rendering with high-quality downsampling for photographic anti-aliasing (supersample).
Background-aware ink: dark-on-light and light-on-dark text, a controllable (often deliberately low) contrast range, neutral or coloured ink, variable opacity, faux-bold and outlines.
Glyph-space augmentations before compositing (rotation, perspective, ink erosion) and image-space degradations after (Gaussian sensor noise, JPEG compression artifacts, blur, brightness/contrast jitter) - matching how a real capture degrades the whole frame.
Optional JPEG output (output_jpeg=True) to match real document captures.

Performance & memory

Font objects and decoded background images are cached, giving a large throughput improvement over re-loading them per sample. Memory stays bounded and tunable:

bg_cache_size (16): number of decoded backgrounds held in memory per worker. Lower it on memory-constrained machines or with many workers; raise it for more background variety. bg_max_dimension (2000) downscales very large backgrounds on load so the cache stays light regardless of source resolution.
Caches are per worker process, so peak memory scales roughly with num_workers.

Configuration reference

All behaviour is controlled through GenerationConfig; see the dataclass docstring in generator/components/config.py for every field and its default.

Resources

fonts_v1: A collection of fonts used for text rendering can be downloaded from Fonts_v1.
background_images_v1: A collection of background images used for text rendering can be downloaded from Background_Images_v1.

Citation

If you wish to cite please refer to the base project citation, feel free to use this BibTeX references:

@misc{docTR-Synth-Generator,
    title={docTR-Synth-Generator: A tool to generate synthetic OCR text datasets - made for docTR},
    author={{Dittrich, Felix}},
    year={2026},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/felixdittrich92/docTR-Synth-Generator}}
}

The automatic word lists are derived from the FrequencyWords project (OpenSubtitles-based) and fonts from Google Fonts / Noto; please respect their respective licenses when redistributing generated datasets.

Development & tests

The test suite is fully offline - it builds a tiny in-memory font with fontTools and monkeypatches the network downloads, so no fonts or corpora are fetched while testing. Run it with:

make test      # pytest + coverage
make quality   # ruff + mypy
make style     # auto-format and fix

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create.

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Add your Changes
Run the tests and quality checks (make test and make style and make quality)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Jun 25, 2026

0.3.1

Jun 25, 2026

0.3.0

Jun 24, 2026

0.2.1

Jun 23, 2026

0.2.0

Jun 23, 2026

This version

0.1.0

Jun 19, 2026

0.0.1

Aug 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctr_synth_generator-0.1.0.tar.gz (107.5 kB view details)

Uploaded Jun 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doctr_synth_generator-0.1.0-py3-none-any.whl (116.3 kB view details)

Uploaded Jun 19, 2026 Python 3

File details

Details for the file doctr_synth_generator-0.1.0.tar.gz.

File metadata

Download URL: doctr_synth_generator-0.1.0.tar.gz
Upload date: Jun 19, 2026
Size: 107.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for doctr_synth_generator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f179d3270ebb904c85b3349f6c9f02153493bef1900ef3422c5c5a09b4b4b846`
MD5	`87dbbf72b704dfa5969a9692edc31158`
BLAKE2b-256	`28a8334f10c4169743e09bcbbe5ddc0c34903fab7dfbf9ded9608e12c73fba5e`

See more details on using hashes here.

File details

Details for the file doctr_synth_generator-0.1.0-py3-none-any.whl.

File metadata

Download URL: doctr_synth_generator-0.1.0-py3-none-any.whl
Upload date: Jun 19, 2026
Size: 116.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for doctr_synth_generator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af91e668b8bfe675de0aa06276a2d5d5bddcb01e5e4589b1c9128f96fc736748`
MD5	`27c6addbdf558db817626bb9aed22150`
BLAKE2b-256	`d3abc2d46f3e001587c2f98fa79a7f825f45deb8245c553d20f95fdb060da344`

See more details on using hashes here.

doctr-synth-generator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docTR-Synth-Generator

Features

Quickstart (zero configuration)

Bring your own resources (classic usage)

Automatic fonts

Automatic words

Automatic backgrounds

Dataset balancing

Detection datasets

Realism

Performance & memory

Configuration reference

Resources

Citation

Development & tests

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes