A synthetic data generator for training OCR models
Project description
docTR-Synth-Generator
A tool to generate synthetic OCR datasets - made for docTR
Features
- Zero-config: generate a dataset with nothing but an output directory - real words, matching fonts and background images are downloaded automatically.
- Multilingual by language code:
languages=["de", "ru", "ar", ...]resolves both the words and the fonts for each script (~85 languages), with correct complex-script shaping and right-to-left layout for Arabic/Hebrew. - No more dropped words: any character a local font cannot render triggers an on-demand download of a font that can, instead of silently skipping the word.
- Realistic output: supersampled anti-aliasing, background-aware ink colour and contrast (dark-on-light and light-on-dark), faux-bold/outlines, and scanner/camera-style degradations (JPEG artifacts, sensor noise, blur).
- Controllable balancing: explicit per-language allocation, a stratified train/val split, optional character-coverage guarantees, and a balance report.
- Recognition and detection: produce word/line crops for recognition, or full document-like pages with per-word polygons for detection - both in the formats docTR's training references expect.
- Fast & memory-bounded: font objects and decoded backgrounds are cached, with a configurable cache size.
Quickstart (zero configuration)
You no longer need to provide a wordlist or a font directory. With nothing but an output directory and a count, the generator downloads real words for the requested language(s) and automatically fetches matching open-source fonts:
from generator import GenerationConfig, SyntheticDatasetGenerator
config = GenerationConfig(output_dir="output_dataset", num_images=1000) # English by default
SyntheticDatasetGenerator(config).generate_dataset()
Multilingual is a one-liner - a language code selects both its words and its script, so the correct fonts are pulled in for you:
config = GenerationConfig(
output_dir="output_dataset",
num_images=10000,
languages=["en", "de", "ru", "el", "ar"], # words + fonts resolved automatically
bg_image_dir="resources/background_images", # optional; blank backgrounds otherwise
)
SyntheticDatasetGenerator(config).generate_dataset()
The first run downloads word lists and fonts from public mirrors and caches them (
corpus_cache_dir/font_cache_dir). Subsequent runs are offline. To run fully offline from the start, supply your ownwordlist_pathandfont_dir.
Bring your own resources (classic usage)
Supplying a wordlist_path and/or font_dir still works and takes precedence
over the automatic downloads:
config = GenerationConfig(
wordlist_path="resources/corpus/latin_ext_balanced_words.txt",
font_dir="resources/font", # e.g. the extracted fonts_v1 release
bg_image_dir="resources/background_images", # bundled with the repo
output_dir="output_dataset",
num_images=1000,
val_percent=0.2,
num_workers=6,
# If a word contains characters none of your local fonts cover, download a
# matching font instead of dropping the word (default: True):
auto_download_fonts=True,
)
SyntheticDatasetGenerator(config).generate_dataset()
Automatic fonts
When no local font covers every character of a word, a matching open-source font
(from the Noto family, which spans the whole
Unicode range) is downloaded, verified for coverage and cached. This prevents
words from being silently skipped - the main cause of biased, latin-only
datasets. Disable with auto_download_fonts=False.
Automatic words
When no wordlist_path is given, real frequency-ranked words for languages
are downloaded (from the open
FrequencyWords project, ~85
languages) and cleaned (script filtering, length bounds, punctuation removal).
Two realism helpers are applied by default and can be tuned or disabled:
casing_variant_prob(0.3): adds Title/UPPERCASE variants so the model sees capital letters (frequency lists are almost all lowercase).numeric_token_ratio(0.05): mixes in realistic numbers, dates, prices and codes - the kind of content real documents are full of.
Automatic backgrounds
When no bg_image_dir is given, a curated set of background images is downloaded
and cached automatically (instead of producing blank backgrounds). Supplying your
own bg_image_dir takes precedence and skips the download entirely - exactly like
fonts and word lists. Disable with auto_download_backgrounds=False, point
background_cache_dir somewhere persistent, or pass a background_manifest_url
(a newline-separated list of filenames/URLs) to use a different collection.
Dataset balancing
For multilingual runs the language mix is explicit and controllable instead of being dominated by whichever language has the most words:
config = GenerationConfig(
output_dir="output_dataset",
num_images=30000,
languages=["en", "de", "ru"],
language_balance="balanced", # "balanced" (default) or "proportional"
# language_weights={"en": 0.6, "de": 0.3, "ru": 0.1}, # or set explicit weights
min_char_coverage=20, # ensure every character appears >= N times (0 = off)
)
The split is stratified: train and val share the same language mix and exact
words do not leak from train into val. A balance report is printed before
generation (per-language train/val counts, train/val overlap, distinct/rare
characters, word-length statistics); silence it with
print_balance_report=False.
Vocabulary coverage (recognition)
A recognition model is trained against a fixed character set (docTR's VOCABS).
Real frequency corpora rarely contain every character of that set - rare
accented capitals (ẞ), currency signs, some punctuation - so a model trained
only on downloaded words never sees them. With ensure_vocab_coverage=True
(the default), each language is mapped to its docTR vocab and extra word-like
tokens are synthesised so every renderable vocab character appears in both the
train and val splits:
config = GenerationConfig(
output_dir="dataset",
num_images=50000,
languages=["de"], # mapped to the "german" vocab automatically
ensure_vocab_coverage=True, # default
vocab_coverage_min_count=3, # each vocab char appears in >= N train samples
)
target_vocaboverrides the per-language mapping - pass aVOCABSkey (e.g."german") or a literal string of characters to cover. It also enables coverage when you supply your ownwordlist_path.- Coverage is enforced after the train/val split, so a rare character can
never land in only one split. This makes
num_imagesa floor: a small, bounded number of coverage samples (proportional to the vocab size, not the dataset) is appended on top. - Languages with no fixed small vocab (CJK) are skipped automatically, and very large scripts (thousands of CJK ideographs / Hangul syllables) are left to the real corpus rather than synthesised.
- Every synthesised token stays within a single script (a Hebrew character is only ever placed in a Hebrew token, etc.), so each renders with one font. When several languages are generated together, coverage is computed over the union of their vocabs but tokens are never mixed across scripts.
- Coverage prefers repeating real corpus words that contain a rare character, so diacritic combinations are linguistically attested. Synthesis is only a fallback for characters absent from the corpus (rare punctuation, currency, capitals in a lower-cased corpus, or marks like Hebrew niqqud that real text omits) - and even then a combining mark is inserted into a real same-script word after a base letter, never rendered alone on a dotted circle.
- A character that no available font can render (e.g.
฿inside a Latin-script vocab) is the one case that cannot be covered - that is a font limitation, reported in the log, not a logic gap.
Detection datasets
Set task="detection" to generate document-like pages with a 4-point
polygon for every word, ready for
docTR detection training:
config = GenerationConfig(
task="detection",
output_dir="detection_dataset",
num_images=5000, # = number of pages
languages=["en", "de"], # words + fonts resolved automatically
bg_image_dir="resources/background_images",
output_jpeg=True,
)
SyntheticDatasetGenerator(config).generate_dataset()
Each split is written as images/ plus a labels.json in the exact docTR
format (absolute pixel coordinates):
{
"00000.jpg": {
"img_dimensions": [1462, 1056],
"img_hash": "<sha256 of the image>",
"polygons": [[[x1, y1], [x2, y2], [x3, y3], [x4, y4]], ...]
}
}
It reuses the same fonts, ink styling, contrast, backgrounds and degradations as the recognition path. Pages are filled top-to-bottom by the available vertical space (word count varies naturally with font size), and words are recycled as needed so a page always fills regardless of how many candidate words it is given.
Real-world layouts
To better mimic real documents, the layout is chosen per page via det_layout:
"paragraph"- multi-block running text with headings and indents."newspaper"- a full-width masthead with a double rule and a dateline, then several narrow columns separated by vertical rules, each with article headlines, bylines and small, tightly-leaded body text (~500-1100 words on an A4-ish page). Tune density withdet_newspaper_columns_range(default(3, 6), clamped to the page width),det_newspaper_font_size_range(default(9, 15)) anddet_newspaper_line_spacing_range(default(1.05, 1.2))."form"- a title with a header rule, thenLabel:/ value rows with either underlines or boxed fields, shaded section-header bars, and occasional checkbox rows."id_card"- a card with a coloured issuing-authority header band (emblem + light title text), a photo placeholder, labelled field rows, a signature line and MRZ-style lines. Mirrors fully for right-to-left scripts."mixed"(default) - a weighted blend of the above; tune viadet_layout_weights(e.g.{"paragraph": 0.4, "newspaper": 0.25, "form": 0.2, "id_card": 0.15}).
Forms and ID cards always render on clean generated paper. All layouts emit the
same per-word polygons, and the optional small global page rotation
(det_rotation_*) rotates the polygons with the page for use with docTR's
use_polygons=True. Other layout knobs: det_page_*_range, det_font_size_range,
det_max_blocks, det_margin_ratio, det_heading_prob.
Backgrounds for detection: only the words you place are labelled, so any text already printed in a background photo becomes an unlabelled false negative.
det_plain_background_prob(0.4) mixes in clean generated paper; set it to1.0for all-paper pages, or pointbg_image_dirat text-free textures (plain paper, surfaces, fabrics) only.
Non-Latin scripts work out of the box: words and fonts are resolved per language,
complex scripts are shaped correctly (Arabic joining, Indic conjuncts), and
right-to-left languages (Arabic, Hebrew, ...) are laid out right-to-left so pages
read naturally. For example languages=["ar"], ["he"], ["zh"] or ["hi"].
Plug into docTR training (on-the-fly, in-RAM)
You can skip writing a dataset to disk entirely and feed freshly synthesised
samples straight into docTR's training scripts. generator/doctr_dataset.py
provides PyTorch Dataset wrappers that generate one sample per
__getitem__, matching docTR's dataset contract - (image_tensor, target) per
sample plus a static collate_fn - so they drop into the existing DataLoader
in
references/detection/train.py
and
references/recognition/train.py.
Targets are identical to docTR's own datasets, so the model transforms and loss
treat them the same: recognition yields the label string; detection yields
{CLASS_NAME: geoms} with absolute-pixel polygons (N, 4, 2) when
use_polygons=True else straight boxes (N, 4) as [xmin, ymin, xmax, ymax].
Detection - in references/detection/train.py, replace the
DetectionDataset(...) construction (keep the DataLoader lines):
from generator.components import GenerationConfig
from generator.doctr_dataset import build_detection_datasets, synth_worker_init_fn
cfg = GenerationConfig(
task="detection",
languages=["en", "de"],
num_images=50_000, # POOL size (word variety + vocab coverage)
auto_download_backgrounds=True,
)
train_set, val_set = build_detection_datasets(
cfg,
train_samples=args.epochs and 20_000, # virtual epoch length (len(dataset))
val_samples=2_000,
use_polygons=args.rotation, # straight boxes unless --rotation
sample_transforms=batch_transforms, # the script's existing transforms
)
Recognition - in references/recognition/train.py, replace the
RecognitionDataset(...) construction:
from generator.components import GenerationConfig
from generator.doctr_dataset import build_recognition_datasets, synth_worker_init_fn
cfg = GenerationConfig(task="recognition", languages=["en", "de"], num_images=100_000)
train_set, val_set = build_recognition_datasets(
cfg,
train_samples=50_000,
val_samples=5_000,
img_transforms=img_transforms, # the script's existing resize/aug
)
The DataLoader lines stay as they are - just keep
collate_fn=train_set.collate_fn and add worker_init_fn=synth_worker_init_fn
so every worker gets an independent RNG stream:
train_loader = DataLoader(
train_set,
batch_size=args.batch_size,
shuffle=True,
drop_last=True,
num_workers=args.workers,
pin_memory=torch.cuda.is_available(),
collate_fn=train_set.collate_fn,
worker_init_fn=synth_worker_init_fn,
)
Notes:
- Pool size vs epoch length.
config.num_imagessizes the word pool (variety and per-split vocab coverage);train_samples/val_samplesset the virtual epoch length (len(dataset)). Samples are generated fresh, so the epoch length is just how many iterations you want per epoch. - Seeding. The train set draws a fresh random sample on every access (new data every epoch - the whole point of on-the-fly); the val set is a reproducible fixed virtual set (seeded per index) so metrics stay comparable.
- Coverage carries over. The recognition pools come from the same balancing and per-split character-coverage pipeline as the offline generator, so sampling from them covers the target vocab.
- One-time setup. Corpora, fonts and backgrounds are downloaded/resolved once when the datasets are built (in the parent process), not per worker.
- Requires PyTorch in your training environment (
pip install python-doctr). Importing the rest of this package never requires torch. For lower-level control you can useSyntheticDetectionDataset/SyntheticRecognitionDatasetdirectly instead of thebuild_*factories.
Realism
Rendered crops are meant to match real captured documents rather than clean synthetic glyphs. The pipeline applies, all configurable:
- Supersampled rendering with high-quality downsampling for photographic
anti-aliasing (
supersample). - Background-aware ink: dark-on-light and light-on-dark text, a controllable (often deliberately low) contrast range, neutral or coloured ink, variable opacity, faux-bold and outlines.
- Glyph-space augmentations before compositing (rotation, perspective, ink erosion) and image-space degradations after (Gaussian sensor noise, JPEG compression artifacts, blur, brightness/contrast jitter) - matching how a real capture degrades the whole frame.
- Optional JPEG output (
output_jpeg=True) to match real document captures.
Performance & memory
Font objects and decoded background images are cached, giving a large throughput improvement over re-loading them per sample. Memory stays bounded and tunable:
bg_cache_size(16): number of decoded backgrounds held in memory per worker. Lower it on memory-constrained machines or with many workers; raise it for more background variety.bg_max_dimension(2000) downscales very large backgrounds on load so the cache stays light regardless of source resolution.- Caches are per worker process, so peak memory scales roughly with
num_workers.
Configuration reference
All behaviour is controlled through GenerationConfig; see the dataclass
docstring in generator/components/config.py for every field and its default.
Resources
- fonts_v1: A collection of fonts used for text rendering can be downloaded from Fonts_v1.
- background_images_v1: A collection of background images used for text rendering can be downloaded from Background_Images_v1.
Citation
If you wish to cite please refer to the base project citation, feel free to use this BibTeX references:
@misc{docTR-Synth-Generator,
title={docTR-Synth-Generator: A tool to generate synthetic OCR text datasets - made for docTR},
author={{Dittrich, Felix}},
year={2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/felixdittrich92/docTR-Synth-Generator}}
}
The automatic word lists are derived from the FrequencyWords project (OpenSubtitles-based) and fonts from Google Fonts / Noto; please respect their respective licenses when redistributing generated datasets.
Development & tests
The test suite is fully offline - it builds a tiny in-memory font with
fontTools and monkeypatches the network downloads, so no fonts or corpora are
fetched while testing. Run it with:
make test # pytest + coverage
make quality # ruff + mypy
make style # auto-format and fix
Contributing
Contributions are what make the open-source community such an amazing place to learn, inspire, and create.
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Add your Changes
- Run the tests and quality checks (
make testandmake styleandmake quality) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature)
License
Distributed under the Apache 2.0 License. See LICENSE for more information.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctr_synth_generator-0.2.1.tar.gz.
File metadata
- Download URL: doctr_synth_generator-0.2.1.tar.gz
- Upload date:
- Size: 129.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e0b834dbf587a140d4a0c21148efc9e5baeb5ac9136dcde550d279ba6f13050
|
|
| MD5 |
6938c54cd2938af214a864259ccee5bd
|
|
| BLAKE2b-256 |
25c9d1f41a57c9a9f2cc17c04ed09bae6af18210a5362f3038eef33e4f9bb9a6
|
File details
Details for the file doctr_synth_generator-0.2.1-py3-none-any.whl.
File metadata
- Download URL: doctr_synth_generator-0.2.1-py3-none-any.whl
- Upload date:
- Size: 132.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a56e77f37d168b5e5d5eaf8e0215d3517e9f106d4c5073792a4902527f3b4be1
|
|
| MD5 |
57c35bf90dfd63c0e9a8303170398b6b
|
|
| BLAKE2b-256 |
e75896d14707f4a722573458a66708abdedf0813950d77e9df7f29eeb9ec2c3a
|