Skip to main content

A contextual spellchecker for OCR output

Project description

OCRfixr

OVERVIEW

This project aims to automate the boring work of manually correcting OCR output from Distributed Proofreaders' book digitization projects

CORRECTING MISREADS

OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn".

As written in book:

"The birds flevv south"

Corrected text:

"The birds flew south"

How OCRfixr Works:

OCRfixr fixes misreads by checking 1) possible spell corrections against the 2) local context of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:

As written in book:

"Days there were when small trade came to the stoie. Then the young clerk read."

Method Plausible Replacements
Spellcheck (TextBlob) stone, store, stoke, stove, stowe, stole, soie
Context (BERT) market, shop, town, city, store, table, village, door, light, markets, surface, place, window, docks, area

Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word.

Corrected text:

"Days there were when small trade came to the store. Then the young clerk read."

Using OCRfixr

The package can be installed using pip.

pip install OCRfixr

By default, OCRfixr only returns the original string, with all changes incorporated:

>>> from ocrfixr import spellcheck

>>> text = "The birds flevv south"
>>> spellcheck(text).replace()
'The birds flew south'

Use return_fixes to also include all corrections made to the text:

>>> spellcheck(text, return_fixes = "T").replace()
['The birds flew south', {'flevv': 'flew'}]

Use full_results_by_paragraph for longer texts, to break out the text & associated changes by paragraph:

>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
>>> spellcheck(text, full_results_by_paragraph = "T").replace()
[['The birds flew down\n', {'flevv': 'flew'}],
 [' south, but were quickly apprehended\n', {'wefe': 'were'}],
 [' by border patrol agents', {}]]

Otherwise, the full text (+ any changes) will be returned in a single object:

>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
>>> spellcheck(text, return_fixes = "T").replace()
['The birds flew down\n south, but were quickly apprehended\n by border patrol agents',
 {'flevv': 'flew', 'wefe': 'were'}]

(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)

Avoiding "Damn You, Autocorrect!"

By design, OCRfixr is change-averse:

  • If spellcheck/context do not line up, no update is made.
  • Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.
  • Only the top 15 context suggestions are considered, to limit low-probability matches.
  • Proper nouns (anything starting with a capital letter) are not evaluated for spelling.

Word context is drawn from all sentences in the current paragraph, to maximize available information, while also not bogging down the BERT model.

Credits

TextBlob powers spellcheck suggestions, and transformers does the heavy lifting for BERT context modelling. All book data comes from Distributed Proofreaders. Support them here: https://www.pgdp.net/c/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

OCRfixr-0.0.5.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

OCRfixr-0.0.5-py3-none-any.whl (9.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page