Skip to main content

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

Project description

Text Quality

This package determines the quality of a (digitized) page in terms of text quality.

Usage

After installation, use the classify_text_quality.py script to classify PageXML or plain text files. For instance, if you want to classify all *.xml files in the pages/ directory, use the --glob argument:

classify_text_quality.py --glob "page/*.xml" --output classifications.csv --output-scores

Per input file, one output line is returned in CSV table format, along with the classification result:

  1. Good quality
  2. Medium quality
  3. Bad quality

All supported parameters:

classify_text_quality.py --help
usage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]

options:
  -h, --help            show this help message and exit
  --output FILE, -o FILE
                        Output file; defaults to stdout.
  --output-scores       Output scores and text statistics.

Input:
  --input [FILE ...], -i [FILE ...]
                        Plain text file(s) to classify. Use '-' for stdin.
  --pagexml [FILE ...]  Input file(s) in PageXML format.
  --pagexml-glob PATTERN, --glob PATTERN
                        A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.
(lahter) carstenschnober@Carstens-MacBook-Pro htr-quality-classifier % 

Notes

The pipeline might emit warnings like this:

UserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names

This is due to the internals of the Scikit-Learn Pipeline object, and can safely be ignored.

The dependencies are pinned to specific versions. While this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions. Hence, requirement dependecies can be changed manually, if you are aware of these issues.

How to use text_quality

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

The project setup is documented in project_setup.md. Feel free to remove this document (and/or the link to this document) if you don't need it.

Installation

To install the text_quality package:

pip install text-quality

Alternatively, install the package from GitHub repository:

git clone https://github.com/LAHTeR/htr-quality-classifier.git
cd htr-quality-classifier
python3 -m pip install .

Documentation

Readthedocs

Contributing

If you want to contribute to the development of text_quality, have a look at the contribution guidelines.

Credits

Logic and implementation are based on Nautilus-OCR.

This package was created with Cookiecutter and the NLeSC/python-template.

Badges

(Customize these badges with your own links, and check https://shields.io/ or https://badgen.net/ to see which other badges are available.)

fair-software.eu recommendations
(1/5) code repository github repo badge
(2/5) license github license badge
(3/5) community registry RSD workflow pypi badge
(4/5) citation DOI
(5/5) checklist workflow cii badge
howfairis fair-software badge
Other best practices  
Static analysis workflow scq badge
Coverage workflow scc badge
Documentation Documentation Status
GitHub Actions  
Build build
Citation data consistency cffconvert
SonarCloud sonarcloud
MarkDown link checker markdown-link-check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_quality-0.1.5.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

text_quality-0.1.5-py3-none-any.whl (2.5 MB view details)

Uploaded Python 3

File details

Details for the file text_quality-0.1.5.tar.gz.

File metadata

  • Download URL: text_quality-0.1.5.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for text_quality-0.1.5.tar.gz
Algorithm Hash digest
SHA256 4ec2dea1bbe309290ff3b1e7565c2e16ddb767d3a7ebb582c43d5d1e8fde0bb7
MD5 02660c2773f18b9fe07056a47f6e0842
BLAKE2b-256 f2394804cfa2c86ba37b4e4f82071c8795aec55da2975e8dc7750ceb34522d2f

See more details on using hashes here.

File details

Details for the file text_quality-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for text_quality-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 edb6d259256a0da008a86e9fa3c0a2e9ce2a5c5366c552a99357b3974813e6ce
MD5 cf7d7b5493974ae2e6abcc2edcb97776
BLAKE2b-256 7cb3af75b56dd3d1a2bdeb41b8177bf8656e1ab5daa8cbb524e774588cefd886

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page