Easily create synthetic data for HTR and OCR
Project description
old-doc
Easily create synthetic data for HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition).
Description
old-doc is a Python package designed to generate synthetic data for training and testing HTR and OCR models. This tool streamlines the process of creating diverse datasets for improving text recognition systems, allowing users to generate custom manuscript-like pages with various text styles, layouts, and effects.
Installation
You can install old-doc using pip:
pip install old-doc
Note: old-doc requires Python 3.8 or later.
Features
- Generate synthetic handwritten text images
- Create synthetic printed document images
- Customize text content, fonts, layouts, and degradation effects
- Support for curved text, drop caps, and marginalia
- Export data in image format and ALTO XML for HTR and OCR tasks
Usage
Here's an example of how to use old-doc to create a sample manuscript page:
from old_doc import TextBlock, Column, Row, Page
title = TextBlock("Simple Document", block_type="heading", font_size=40, font_color=(100, 0, 0))
content = TextBlock("This is a sample text for our document. " * 5,
font_size=16, font_color=(0, 0, 0),
curve_amount=0.1, # Slight curve to the text
word_spacing=10
)
# Create layout
header_row = Row([Column([title], width=800)], height=60)
content_row = Row([Column([content], width=800)], height=400)
# Create page
page = Page([header_row, content_row],
cell_padding=20,
background_color=(250, 240, 230)) # Light parchment color
# Generate the page
image, alto = page.generate()
# Save the results
image.save("example.png")
page.save_alto_xml("example.alto.xml")
# Display the image (optional, requires matplotlib)
page.visualize_results()
This example creates a manuscript page with a header, date, main content with curved text and potential drop caps, and marginalia. It then generates the page, visualizes it, and saves both the image and ALTO XML output.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file old_doc-0.0.3.tar.gz
.
File metadata
- Download URL: old_doc-0.0.3.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2df8f6cd2a252b66ec4ae4efd27d3572de3541d1110ba811c7757e361b04820d |
|
MD5 | eb3fc411719ca33aa0b78c2b31cca04e |
|
BLAKE2b-256 | 4c371c93ab377ab4a61a0508fca9537699e1faffba75206a080b7cc11bdc58fa |
File details
Details for the file old_doc-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: old_doc-0.0.3-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a76dd90a137f3e1b7aea2a668e533b236e4b8c82621ed4ce9eb35bdbb5b1c09 |
|
MD5 | 3316191a37801b071ba4e913df474581 |
|
BLAKE2b-256 | 33984eb7716f96c70903901d172bd749a602d8fa894d3203fd1b9b5c96327d06 |