Tools to simulate post-OCR noisy texts.
Project description
NoisOCR
Tools to simulate post-OCR noisy texts.
Features:
- Sliding window;
- Sliding window with hyphenation;
- Simulate text errors;
- Simulate text annotations.
Install
pip install noisocr
Sliding window:
import noisocr
text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50
windows = noisocr.sliding_window(text, max_window_size)
# Output:
# [
# 'Lorem Ipsum is simply dummy text of the printing',
# ...
# 'type and scrambled it to make a type specimen',
# 'book.'
# ]
Sliding window with hyphenation:
- See the package https://pypi.org/project/PyHyphen to see all supported languages.
import noisocr
text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50
windows = noisocr.sliding_window(text, max_window_size)
# Output:
# [
# 'Lorem Ipsum is simply dummy text of the printing ',
# 'typesetting industry. Lorem Ipsum has been the in-',
# ...
# 'scrambled it to make a type specimen book.'
# ]
Simulate text errors:
- See the package https://pypi.org/project/typo to see all possible error types.
import noisocr
text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!
Simulate text annotations:
- By default, the annotations found in the BRESSAY dataset were used. But you can define which types of annotations you want to simulate. For annotations with internal text, use the pattern
##--text--##
.
import noisocr
text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
noisocr-0.2.tar.gz
(3.5 kB
view details)
Built Distribution
noisocr-0.2-py3-none-any.whl
(3.8 kB
view details)
File details
Details for the file noisocr-0.2.tar.gz
.
File metadata
- Download URL: noisocr-0.2.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 879410c6e670d27c80d31e9966dee77cd3d8d48e20d451160309cfaab602eecd |
|
MD5 | 1e767473597ad055cd8af3ea42c4433a |
|
BLAKE2b-256 | 5de4f64a436144a0a1bfba8b8f80715550e4b7140e29074b8b74d45e1f69df12 |
File details
Details for the file noisocr-0.2-py3-none-any.whl
.
File metadata
- Download URL: noisocr-0.2-py3-none-any.whl
- Upload date:
- Size: 3.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48fdd99e492695db1f0fb6d58a3405af5fb9d693f651ccd9fef0772b6636151a |
|
MD5 | 79d732ed1f7809dbdb2bf9e5d64a8090 |
|
BLAKE2b-256 | 32ae5d6c16528943a2f97ce45f738a23775b88b1c5ff4b9265cb3b7b43531ed2 |