Post-OCR correction in 3 stages.
Project description
postocr-3stages
– A Python package for post-OCR correction in 3 stages
🚧 Careful! Work in progress! 🚧
This is the home of the postocr-3stages
Python package.
It aims at offering a simple solution for post-OCR correction with a friendly, scikit-learn-like API.
📦 Installation
For now, the module can be almost completely installed using pip, except for our prototype for fast computation of edits between OCR and corrected strings. The installation protocol can be summarized as follows:
# create a virtual env "myvenv" and activate it
python -m venv myvenv
. myvenv/bin/activate
# install postocr package
pip install postocr-3stages
# manually download and install our irsi-tools package, with Python bindings of the original ISRI code
# - dependency
pip install pybind11
# - download the archive and uncompress
wget https://github.com/soduco/paper-ner-bench-das22/archive/refs/heads/main.zip
unzip main.zip
# - build and install
cd paper-ner-bench-das22-main/src/ocr/
pip install .
🚀 Quick start
This package tries to mimic the interface of sklearn for easier use and compatibility.
Here is a minimal example to understand how the library works. In this example we try to make the model learn to transform the string "x" into "y".
The .score
functions compute the Character Error Rate (CER), the lower, the better.
"OCR"
: original transcription from some OCR system to correct"Gold"
: target transcription
import pandas as pd
from postocr_3stages.error_detector import ErrorDetector
from postocr_3stages.controller import LengthController
from postocr_3stages.NMT_corrector import NMTCorrector
from postocr_3stages.pipeline import Pipeline
x = pd.Series(["x"] * 100)
y = pd.Series(["y"] * 100)
pipeline = Pipeline(
error_detector=ErrorDetector(),
nmt_corrector=NMTCorrector(train_steps=10),
controller=LengthController(),
)
pipeline.fit(x, y)
pipeline.score(x, y) # 0 of CER
pipeline.predict(pd.Series("x")) # predict "y"
For more examples, see our demo notebook.
🛠️ Prepare your data
For both training and inference, data should be passed as a Pandas' Series of raw text. Here is a minimal example of what such series could look like:
import pandas as pd
X = pd.Series(["hello i'm some ocred text"])
y = pd.Series(["hello i'm the correct text"])
🔧 API
The module is composed of 4 objects, with a similar methods (fit()
, predict()
and score()
) for each of them, but with different parameters.
postocr_3stages.Pipeline
: High-level utility
This is the end-to-end utility for direct training and correction.
Methods:
fit(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> postocr_3stages.Pipeline
: Train the complete pipeline using pairs of (OCR, Gold) string samples.score(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> float
: Compute the CER of the complete pipeline using pairs of (OCR, Gold) string samples.predict(ocr: Pandas.Series[str]) -> Pandas.Series[str]
: Predict the corrected version of each string from the input.
postocr_3stages.Detector
: Error Detector
Direct access to the error detection module. Warning: it classifies whole sequences and does not return the exact position of errors. A string is either correct or erroneous.
Methods:
fit(ocr: Pandas.Series[str], gold: Pandas.Series[bool]) -> postocr_3stages.Detector
: Train the complete pipeline using pairs of (OCR, correct?) samples.score(ocr: Pandas.Series[str], gold: Pandas.Series[bool]) -> float
: Compute the accuracy of the detector.predict(ocr: Pandas.Series[str]) -> Pandas.Series[bool]
: Predict whether each string sample is correct or not (contains errors).
postocr_3stages.Corrector
: Error Corrector
Direct access to the error correction module. It tries to transform each string given as input to reduce the number of errors it contains. It should be called on erroneous strings only.
Methods:
fit(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> postocr_3stages.Corrector
: Train the complete pipeline using pairs of (OCR, Gold) string samples.score(ocr: Pandas.Series[str], gold: Pandas.Series[str]) -> float
: Compute the CER of the complete pipeline using pairs of (OCR, Gold) string samples.predict(ocr: Pandas.Series[str]) -> Pandas.Series[str]
: Predict the corrected version of each string from the input.
postocr_3stages.Verifier
: Correction Verifier
Direct access to the correction verification module. This module takes as input a numpy.ndarray
which contains the indicators from which decision should be taken (in our case insertion and deletion rates) and predicts whether the correction which lead to these indicators should be kept or discarded.
Methods:
fit(X: np.ndarray[float], y: np.ndarray[bool]) -> postocr_3stages.Verifier
: Train the complete pipeline using pairs of (features, target) string samples.score(X: np.ndarray[float], y: np.ndarray[bool]) -> float
: Compute the accuracy of the verifier.predict(X: np.ndarray[float]) -> np.ndarray[bool]
: Predict whether the corrections which led to the indicators passed as input should be kept or discarded.
🐛 🐞 🦗 🪳 Bugs
There are some known (and many unknowns) problems in this work-in-progress implementation.
- Our Python bindings of the original IRSI tools may have memory some leaks.
📝 TODO
- Make
pip install
work out of the box - Add options for GPU/CPU training
- Implement loading/saving properly
- Offer some choice for the correction model (try a transformer model)
- Offer some choice for the verification model (add a simple edit length filter)
- Add some support for self-supervised pretraining
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file postocr_3stages-0.1.0.tar.gz
.
File metadata
- Download URL: postocr_3stages-0.1.0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.6 Linux/5.15.0-71-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f489c12dae2b671100d961c1269e4acf528081e0faa213c016cc450c168a934 |
|
MD5 | f653305ac3bd0d5c5c17cbcddc3d94ec |
|
BLAKE2b-256 | c879dfdc6d3c876f32cb19b89f9a44de8494d264dfc6c4f6aac28dc2da77424c |
File details
Details for the file postocr_3stages-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: postocr_3stages-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.6 Linux/5.15.0-71-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de54f625b243ba591afb685ddefdbdfd7f39129a398b0648b263466896f75f2f |
|
MD5 | 30f02dc414819bff9e3d498d36bd5355 |
|
BLAKE2b-256 | 19d7b9b15a5d72bfac05c42ce3f3a1af99476e0b765103e0deb661253a3e9109 |