Skip to main content

Post-OCR correction in 3 stages.

Project description

postocr-3stages – A Python package for post-OCR correction in 3 stages

🚧 Careful! Work in progress! 🚧

This is the home of the postocr-3stages Python package. It aims at offering a simple solution for post-OCR correction with a friendly, scikit-learn-like API.

📦 Installation

For now, the module can be almost completely installed using pip, except for our prototype for fast computation of edits between OCR and corrected strings. The installation protocol can be summarized as follows:

# create a virtual env "myvenv" and activate it
python -m venv myvenv 
. myvenv/bin/activate

# install postocr package
pip install postocr-3stages

# manually download and install our irsi-tools package, with Python bindings of the original ISRI code
# - dependency
pip install pybind11
# - download the archive and uncompress
wget https://github.com/soduco/paper-ner-bench-das22/archive/refs/heads/main.zip
unzip main.zip
# - build and install
cd paper-ner-bench-das22-main/src/ocr/
pip install .

🚀 Quick start

This package tries to mimic the interface of sklearn for easier use and compatibility.

Here is a minimal example to understand how the library works. In this example we try to make the model learn to transform the string "x" into "y". The .score functions compute the Character Error Rate (CER), the lower, the better.

  • "OCR": original transcription from some OCR system to correct
  • "Gold": target transcription
import pandas as pd

from postocr_3stages.error_detector import ErrorDetector
from postocr_3stages.controller import LengthController
from postocr_3stages.NMT_corrector import NMTCorrector
from postocr_3stages.pipeline import Pipeline

x = pd.Series(["x"] * 100)
y = pd.Series(["y"] * 100)

pipeline = Pipeline(
    error_detector=ErrorDetector(),
    nmt_corrector=NMTCorrector(train_steps=10),
    controller=LengthController(),
)

pipeline.fit(x, y)
pipeline.score(x, y) # 0 of CER
pipeline.predict(pd.Series("x")) # predict "y"

For more examples, see our demo notebook.

🛠️ Prepare your data

For both training and inference, data should be passed as a Pandas' Series of raw text. Here is a minimal example of what such series could look like:

import pandas as pd

X = pd.Series(["hello i'm some ocred text"])
y = pd.Series(["hello i'm the correct text"])

🔧 API

The module is composed of 4 objects, with a similar methods (fit(), predict() and score()) for each of them, but with different parameters.

postocr_3stages.Pipeline: High-level utility

This is the end-to-end utility for direct training and correction.

Methods:

  • fit(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> postocr_3stages.Pipeline: Train the complete pipeline using pairs of (OCR, Gold) string samples.
  • score(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> float: Compute the CER of the complete pipeline using pairs of (OCR, Gold) string samples.
  • predict(ocr: Pandas.Series[str]) -> Pandas.Series[str]: Predict the corrected version of each string from the input.

postocr_3stages.Detector: Error Detector

Direct access to the error detection module. Warning: it classifies whole sequences and does not return the exact position of errors. A string is either correct or erroneous.

Methods:

  • fit(ocr: Pandas.Series[str], gold: Pandas.Series[bool]) -> postocr_3stages.Detector: Train the complete pipeline using pairs of (OCR, correct?) samples.
  • score(ocr: Pandas.Series[str], gold: Pandas.Series[bool]) -> float: Compute the accuracy of the detector.
  • predict(ocr: Pandas.Series[str]) -> Pandas.Series[bool]: Predict whether each string sample is correct or not (contains errors).

postocr_3stages.Corrector: Error Corrector

Direct access to the error correction module. It tries to transform each string given as input to reduce the number of errors it contains. It should be called on erroneous strings only.

Methods:

  • fit(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> postocr_3stages.Corrector: Train the complete pipeline using pairs of (OCR, Gold) string samples.
  • score(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> float: Compute the CER of the complete pipeline using pairs of (OCR, Gold) string samples.
  • predict(ocr: Pandas.Series[str]) -> Pandas.Series[str]: Predict the corrected version of each string from the input.

postocr_3stages.Verifier: Correction Verifier

Direct access to the correction verification module. This module takes as input a numpy.ndarray which contains the indicators from which decision should be taken (in our case insertion and deletion rates) and predicts whether the correction which lead to these indicators should be kept or discarded.

Methods:

  • fit(X: np.ndarray[float], y: np.ndarray[bool]) -> postocr_3stages.Verifier: Train the complete pipeline using pairs of (features, target) string samples.
  • score(X: np.ndarray[float], y: np.ndarray[bool]) -> float: Compute the accuracy of the verifier.
  • predict(X: np.ndarray[float]) -> np.ndarray[bool]: Predict whether the corrections which led to the indicators passed as input should be kept or discarded.

🐛 🐞 🦗 🪳 Bugs

There are some known (and many unknowns) problems in this work-in-progress implementation.

  • Our Python bindings of the original IRSI tools may have memory some leaks.

📝 TODO

  • Make pip install work out of the box
  • Add options for GPU/CPU training
  • Implement loading/saving properly
  • Offer some choice for the correction model (try a transformer model)
  • Offer some choice for the verification model (add a simple edit length filter)
  • Add some support for self-supervised pretraining

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

postocr_3stages-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

postocr_3stages-0.1.0-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file postocr_3stages-0.1.0.tar.gz.

File metadata

  • Download URL: postocr_3stages-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.10.6 Linux/5.15.0-71-generic

File hashes

Hashes for postocr_3stages-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f489c12dae2b671100d961c1269e4acf528081e0faa213c016cc450c168a934
MD5 f653305ac3bd0d5c5c17cbcddc3d94ec
BLAKE2b-256 c879dfdc6d3c876f32cb19b89f9a44de8494d264dfc6c4f6aac28dc2da77424c

See more details on using hashes here.

File details

Details for the file postocr_3stages-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: postocr_3stages-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.10.6 Linux/5.15.0-71-generic

File hashes

Hashes for postocr_3stages-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de54f625b243ba591afb685ddefdbdfd7f39129a398b0648b263466896f75f2f
MD5 30f02dc414819bff9e3d498d36bd5355
BLAKE2b-256 19d7b9b15a5d72bfac05c42ce3f3a1af99476e0b765103e0deb661253a3e9109

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page