[CHaracter Ocr COordination for MUFI iN texts] A simple script to maintain a reasonable training set of HTR/OCR characters
Project description
Choco-Mufin
[CHaracter Ocr COordination for MUFI iN texts]
Tools for normalizing the use of some characters and checking file consistencies. Mainly target at dealing with overly diverse ways to transcribe medieval data (allographetic and graphematic for example) while keeping information such as abbreviation, hence MUFI.
Install
pip install chocomufin
Commands
The workflow is generally the following: you generate a conversion table (choco-mufin generate table.csv your-files.xml
), then
use this table to either control (choco-mufin control table.csv your-files.xml
) or convert them (choco-mufin convert table.csv your-files.xml
).
Conversion will automatically add a suffix which you can define with --suffix
.
Example table of conversion
char,name,normalized,codepoint,mufidecode
ī,LATIN SMALL LETTER I WITH MACRON,ĩ,012B,i
ı,LATIN SMALL LETTER DOTLESS I,i,0131,i
ff,LATIN SMALL LIGATURE FF,ff,FB00,ff
A,LATIN CAPITAL LETTER A,A,0041,A
B,LATIN CAPITAL LETTER B,B,0042,B
C,LATIN CAPITAL LETTER C,C,0043,C
D,LATIN CAPITAL LETTER D,D,0044,D
As table:
char | name | normalized | codepoint | mufidecode |
---|---|---|---|---|
ī | LATIN SMALL LETTER I WITH MACRON | ĩ | 012B | i |
ı | LATIN SMALL LETTER DOTLESS I | i | 0131 | i |
ff | LATIN SMALL LIGATURE FF | ff | FB00 | ff |
A | LATIN CAPITAL LETTER A | A | 0041 | A |
B | LATIN CAPITAL LETTER B | B | 0042 | B |
C | LATIN CAPITAL LETTER C | C | 0043 | C |
D | LATIN CAPITAL LETTER D | D | 0044 | D |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chocomufin-0.0.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be7200ca3890e0388c86a2833d5a10f67fcb7496adefd2a28469f96933cc5007 |
|
MD5 | 5b7f492150a553f041034c7547b34c4d |
|
BLAKE2b-256 | 6210b889446282e9468b8fa4842a74aaaa208b07d96b155c6d340cf0d5d4de49 |