Skip to main content

Easily create synthetic data for HTR and OCR

Project description

old-doc

Easily create synthetic data for HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition).

Description

old-doc is a Python package designed to generate synthetic data for training and testing HTR and OCR models. This tool streamlines the process of creating diverse datasets for improving text recognition systems, allowing users to generate custom manuscript-like pages with various text styles, layouts, and effects.

Installation

You can install old-doc using pip:

pip install old-doc

Note: old-doc requires Python 3.8 or later.

Features

  • Generate synthetic handwritten text images
  • Create synthetic printed document images
  • Customize text content, fonts, layouts, and degradation effects
  • Support for curved text, drop caps, and marginalia
  • Export data in image format and ALTO XML for HTR and OCR tasks

Usage

Here's an example of how to use old-doc to create a sample manuscript page:

from old_doc import TextBlock, Column, Row, Page

title = TextBlock("Simple Document", block_type="heading", font_size=40, font_color=(100, 0, 0))
content = TextBlock("This is a sample text for our document. " * 5, 
                    font_size=16, font_color=(0, 0, 0), 
                    curve_amount=0.1,  # Slight curve to the text
                    word_spacing=10
                    )

# Create layout
header_row = Row([Column([title], width=800)], height=60)
content_row = Row([Column([content], width=800)], height=400)

# Create page
page = Page([header_row, content_row], 
            cell_padding=20, 
            background_color=(250, 240, 230))  # Light parchment color

# Generate the page
image, alto = page.generate()

# Save the results
image.save("example.png")
page.save_alto_xml("example.alto.xml")

# Display the image (optional, requires matplotlib)
page.visualize_results()

This example creates a manuscript page with a header, date, main content with curved text and potential drop caps, and marginalia. It then generates the page, visualizes it, and saves both the image and ALTO XML output.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

old-doc-0.0.2.tar.gz (6.4 kB view hashes)

Uploaded Source

Built Distribution

old_doc-0.0.2-py3-none-any.whl (10.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page