Skip to main content

Tools to simulate post-OCR noisy texts.

Project description

NoisOCR

Tools to simulate post-OCR noisy texts.

Features:

  • Sliding window;
  • Sliding window with hyphenation;
  • Simulate text errors;
  • Simulate text annotations.

Install

pip install noisocr

Sliding window:

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]

Sliding window with hyphenation:

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]

Simulate text errors:

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!

Simulate text annotations:

  • By default, the annotations found in the BRESSAY dataset were used. But you can define which types of annotations you want to simulate. For annotations with internal text, use the pattern ##--text--##.
import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

noisocr-0.2.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

noisocr-0.2-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file noisocr-0.2.tar.gz.

File metadata

  • Download URL: noisocr-0.2.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for noisocr-0.2.tar.gz
Algorithm Hash digest
SHA256 879410c6e670d27c80d31e9966dee77cd3d8d48e20d451160309cfaab602eecd
MD5 1e767473597ad055cd8af3ea42c4433a
BLAKE2b-256 5de4f64a436144a0a1bfba8b8f80715550e4b7140e29074b8b74d45e1f69df12

See more details on using hashes here.

File details

Details for the file noisocr-0.2-py3-none-any.whl.

File metadata

  • Download URL: noisocr-0.2-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for noisocr-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 48fdd99e492695db1f0fb6d58a3405af5fb9d693f651ccd9fef0772b6636151a
MD5 79d732ed1f7809dbdb2bf9e5d64a8090
BLAKE2b-256 32ae5d6c16528943a2f97ce45f738a23775b88b1c5ff4b9265cb3b7b43531ed2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page