Skip to main content

Generate large amounts of image-based PDF test data for file-based OCR and Document Management Solutions.

Project description

ocrtestdata

ocrtestdata is a utility for generating large volumes of image‑based PDF files designed to test OCR (Optical Character Recognition) systems under real‑world conditions. Whether you’re building an OCR pipeline or validating the resilience of an existing solution, this simple tool helps you simulate demanding workloads.

  • Can be used for tests such as:
    • Load tests → Tests that measure system performance and scalability under realistic or increasing load.
    • Stress tests → Tests that push the system beyond its normal limits to evaluate stability and fault tolerance.
    • Performance tests → General tests that assess speed, throughput, and response times.
    • Endurance tests → Long-duration tests that run the system under sustained load to detect memory leaks or stability issues.
  • Pages are images created with Pillow; text is generated with Faker; QR codes are embedded as images.

Features

  • Multi-page PDFs where each page is an image.
  • Language of dummy files can be adjusted using the locale option.
  • Pages are either text pages or QR pages (every 3rd page if --qr is provided).
  • The QR codes can be used to simulate separator sheets
  • Atomic write: PDFs are created in a temporary directory and then copied to the destination.
  • Batch generation: Create large numbers of PDF files with unique filenames in batches.
  • Batch rules: if -b > 10, duplicates are created from a generated set.
  • Run duration option to stop after a total elapsed time or run forever.
  • Clean shutdown on Ctrl+C with statistics.

Installation

You can install ocrtestdata directly from PyPI using pip:

pip install ocrtestdata

CLI

ocrtestdata --help

usage: ocrtestdata [-h] [-b B] [-t T] [-r RUN_DURATION] [-l LOCALE] [-p P]
                   [--qr QR] [--dpi DPI] [-o OUTPUT]

Generate image-based PDF test data for OCR testing.

options:
  -h, --help            show this help message and exit
  -b B                  number of PDFs to create in a batch (default: 1)
  -t T                  timer in seconds between batches (default: 0). Minimum
                        one batch even if 0.
  -r RUN_DURATION, --run-duration RUN_DURATION
                        total run duration limit in seconds; 0 means no duration
                        limit (default: 0)
  -l LOCALE, --locale LOCALE
                        locale for Faker (default: system locale)
  -p P                  number of pages per PDF (default: 10)
  --qr QR               If provided, every 3rd page will be a QR page
                        containing this text
  --dpi DPI             DPI for page images (default: 300)
  -o OUTPUT, --output OUTPUT
                        output directory for PDFs (default: current working
                        directory)

EXAMPLES

Create one PDF with default settings in the current directory:

ocrtestdata 

Create 30 PDF files with 50 pages of French text each (150 DPI, QR codes) in the ./out directory every 10 seconds for a total duration of 20 minutes.

ocrtestdata -b 30 -p 50 -l fr_FR --dpi 150 --qr "SEPARATOR" -t 10 -r 1200 -o ./out

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrtestdata-0.1.0.tar.gz (357.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrtestdata-0.1.0-py3-none-any.whl (359.3 kB view details)

Uploaded Python 3

File details

Details for the file ocrtestdata-0.1.0.tar.gz.

File metadata

  • Download URL: ocrtestdata-0.1.0.tar.gz
  • Upload date:
  • Size: 357.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ocrtestdata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a194aaa604cb4225aac19dcec9b1ec0b030960d87a4cff97df628624ee603357
MD5 68674ae14972199cb101786b9149abcd
BLAKE2b-256 c00044aa7741891e23538f465bb2807f4db0e8b673a88b365b60b2652b7a45bf

See more details on using hashes here.

File details

Details for the file ocrtestdata-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ocrtestdata-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 359.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ocrtestdata-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d72c539c50d8e550339f77835363fcd6423bca958b20b9b50bfb91e2ddd45716
MD5 e4ea53ed6021cb178384078b54a33ce9
BLAKE2b-256 4eac4b9eb8faff81d20931ed6bd3008f12876efce60f8e8a1dda757d4cdc3bda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page