Skip to main content

Easily create synthetic data for HTR and OCR

Project description

old-doc

Easily create synthetic data for HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition).

Description

old-doc is a Python package designed to generate synthetic data for training and testing HTR and OCR models. This tool streamlines the process of creating diverse datasets for improving text recognition systems, allowing users to generate custom manuscript-like pages with various text styles, layouts, and effects.

Installation

You can install old-doc using pip:

pip install old-doc

Note: old-doc requires Python 3.8 or later.

Features

  • Generate synthetic handwritten text images
  • Create synthetic printed document images
  • Customize text content, fonts, layouts, and degradation effects
  • Support for curved text, drop caps, and marginalia
  • Export data in image format and ALTO XML for HTR and OCR tasks

Usage

Here's an example of how to use old-doc to create a sample manuscript page:

from old_doc import TextBlock, Column, Row, Page

title = TextBlock("Simple Document", block_type="heading", font_size=40, font_color=(100, 0, 0))
content = TextBlock("This is a sample text for our document. " * 5, 
                    font_size=16, font_color=(0, 0, 0), 
                    curve_amount=0.1,  # Slight curve to the text
                    word_spacing=10
                    )

# Create layout
header_row = Row([Column([title], width=800)], height=60)
content_row = Row([Column([content], width=800)], height=400)

# Create page
page = Page([header_row, content_row], 
            cell_padding=20, 
            background_color=(250, 240, 230))  # Light parchment color

# Generate the page
image, alto = page.generate()

# Save the results
image.save("example.png")
page.save_alto_xml("example.alto.xml")

# Display the image (optional, requires matplotlib)
page.visualize_results()

This example creates a manuscript page with a header, date, main content with curved text and potential drop caps, and marginalia. It then generates the page, visualizes it, and saves both the image and ALTO XML output.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

old_doc-0.0.3.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

old_doc-0.0.3-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file old_doc-0.0.3.tar.gz.

File metadata

  • Download URL: old_doc-0.0.3.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for old_doc-0.0.3.tar.gz
Algorithm Hash digest
SHA256 2df8f6cd2a252b66ec4ae4efd27d3572de3541d1110ba811c7757e361b04820d
MD5 eb3fc411719ca33aa0b78c2b31cca04e
BLAKE2b-256 4c371c93ab377ab4a61a0508fca9537699e1faffba75206a080b7cc11bdc58fa

See more details on using hashes here.

File details

Details for the file old_doc-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: old_doc-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for old_doc-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8a76dd90a137f3e1b7aea2a668e533b236e4b8c82621ed4ce9eb35bdbb5b1c09
MD5 3316191a37801b071ba4e913df474581
BLAKE2b-256 33984eb7716f96c70903901d172bd749a602d8fa894d3203fd1b9b5c96327d06

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page