A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

These details have not been verified by PyPI

Project links

Project description

Text Quality

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

The current pipeline is tuned on (historic) Dutch language, and will not perform well on other languages. However, the underlying model has been used for other (Germanic) languages, and can be adapted and applied to texts of other languages and time periods.

Examples

Good quality (not necessarily perfect):

Van
Malacca den 29 maart 1.
door zoo veel ruijmer handen te hebben,
[…]
Siac van waar op den 5=e deeser,
na onse verschijde adhortaties, is over
eeen gekomen
zoo meede van Siac

Bad quality:

uijtkoops --
winst suijverevense versis
e ee
,, 19
1 oe
na aftrek van
5 p:s C: Commiss:s
t 1a per 't geheel t p=s lb. off @'t geheeke
[…]

What's Missing

Pipelines for languages other than historic Dutch
Automatic training procedure for creating and update pipelines
Additional features such as publication year.

See this notebook for a semi-automated pipeline creation process.

How to use text_quality

After installation, use the classify_text_quality.py script to classify PageXML or plain text files. For instance, if you want to classify all *.xml files in the pages/ directory, use the --glob argument:

classify_text_quality.py --glob "page/*.xml" --output classifications.csv --output-scores

Per input file, one output line is returned in CSV table format, along with the classification result:

Good quality
Medium quality
Bad quality

All supported parameters:

$ classify_text_quality.py --help
usage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]

options:
  -h, --help            show this help message and exit
  --output FILE, -o FILE
                        Output file; defaults to stdout.
  --output-scores       Output scores and text statistics.

Input:
  --input [FILE ...], -i [FILE ...]
                        Plain text file(s) to classify. Use '-' for stdin.
  --pagexml [FILE ...]  Input file(s) in PageXML format.
  --pagexml-glob PATTERN, --glob PATTERN
                        A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.

Notes

The pipeline might emit warnings like this:

UserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names

This is due to the internals of the Scikit-Learn Pipeline object, and can safely be ignored.

The dependencies are pinned to specific versions. While this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions. Hence, requirement dependecies can be changed manually, if you are aware of these issues.

The project setup is documented in project_setup.md. Feel free to remove this document (and/or the link to this document) if you don't need it.

Installation

To install the text_quality package:

pip install -U text-quality

Alternatively, install the package from GitHub repository:

git clone https://github.com/LAHTeR/htr-quality-classifier.git
cd htr-quality-classifier
python3 -m pip install -U .

Documentation

Readthedocs

Contributing

If you want to contribute to the development of text_quality, have a look at the contribution guidelines.

Credits

Logic and implementation are based on Nautilus-OCR.

This package was created with Cookiecutter and the NLeSC/python-template.

Badges

(Customize these badges with your own links, and check https://shields.io/ or https://badgen.net/ to see which other badges are available.)

fair-software.eu recommendations
(1/5) code repository
(2/5) license
(3/5) community registry
(4/5) citation
(5/5) checklist
howfairis
Other best practices
Static analysis
Coverage
Documentation
GitHub Actions
Build
Citation data consistency
SonarCloud
MarkDown link checker

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Nov 16, 2023

0.3.0

Nov 16, 2023

0.2.2

Jul 27, 2023

This version

0.2.0

Jul 27, 2023

0.1.5

Apr 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_quality-0.2.0.tar.gz (2.5 MB view details)

Uploaded Jul 27, 2023 Source

Built Distribution

text_quality-0.2.0-py3-none-any.whl (2.5 MB view details)

Uploaded Jul 27, 2023 Python 3

File details

Details for the file text_quality-0.2.0.tar.gz.

File metadata

Download URL: text_quality-0.2.0.tar.gz
Upload date: Jul 27, 2023
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for text_quality-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4a7aa26547ee1ffabd88e24440f6d71be9c521f1ae243690104b61bd34555d93`
MD5	`ca0f46a80c5e8675e9865117d6d6e8bf`
BLAKE2b-256	`9385125e57a12815457db8001a6899e8b04bbb58e5ee839ddbc6dbb84e62852c`

See more details on using hashes here.

File details

Details for the file text_quality-0.2.0-py3-none-any.whl.

File metadata

Download URL: text_quality-0.2.0-py3-none-any.whl
Upload date: Jul 27, 2023
Size: 2.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for text_quality-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ce4a245218e3c83f3f1fb662c7abf29764f55db517b7f526e8cf8c016ed65b0`
MD5	`6ba5741921baa18700dd7ff77601e535`
BLAKE2b-256	`8fda03c10ddadd63f2fc6fe3a4843ecc5164f6d3c160edd45274c1165ef72b54`

See more details on using hashes here.

text-quality 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Text Quality

Examples

What's Missing

How to use text_quality

Notes

Installation

Documentation

Contributing

Credits

Badges

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes