Skip to main content

A set utilities for hadling alphabets of corpora and OCR/HTR datasets

Project description

PyLemmatize

Getting started

PyPI Python Build Docs License: MIT

A framework for assisting transliterations and character-sets in python.

PyLeLemmatize is a Python package for lemmatizing characters. It provides a simple and efficient way to reduce large character sets to simpler ones.

Installation

Install from pypi

To install PyLemmatize from Pypi:

pip install pylelemmatize

for installation for coding, look at [development](### Development Installation)

Python Usage

Simple letter lemmatization

from pylelemmatize import charsets, llemmatize

greek_poly_string = "Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον."

print(f"Polytonic   : {greek_poly_string}")
print(f"Modern Greek: {llemmatize(greek_poly_string, charsets.iso_8859_7)}")
print(f"ASCII       : {llemmatize(greek_poly_string, charsets.ascii)}")

Output:

Polytonic   : Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον.
Modern Greek: Καί ότε ήνοιξεν τήν σφραγίδα τήν έβδόμην, έγένετο σιγή έν τώ ούρανώ ώς ήμιώριον.
ASCII       : Kai ote enoixen ten spragida ten ebdomen, egeneto sige en to ourano os emiorion.

Efficient letter lemmatization

Creating automoatic llemmatizers is expencive O(|input_alphabet|x|output_alphabet|) Once they are created they are equally fast regardless of of their sizes. The following IPython codesnipet demonstrates the cost of creating vs applying llemmatizers.

from pylelemmatize import charsets, llemmatizer

greek_poly_string = "Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον."

print("Creating autoaligned llemmatizers O(|src_alphabet|x|dst_alphabet|)")
print("Medium llemmatizer: |34|x|186|")
%timeit polytonic2modern_greek = llemmatizer(greek_poly_string, charsets.iso_8859_7)
polytonic2modern_greek = llemmatizer(greek_poly_string, charsets.iso_8859_7)

print("Large llemmatizer: |100|x|3549|")
%timeit mes2ascii = llemmatizer(charsets.mes3a, charsets.ascii)
mes2ascii = llemmatizer(charsets.mes3a, charsets.ascii)

print("\nApplying the medium and large llemmatizers on strings:")
for inp_str in [greek_poly_string, greek_poly_string * 1000, greek_poly_string * 1000000]:
    modern_greek_str =  polytonic2modern_greek(inp_str)
    print(f"\nString size: {len(inp_str)}")
    %timeit modern_greek_str =  polytonic2modern_greek(inp_str)
    modern_greek_str =  polytonic2modern_greek(inp_str)
    %timeit modern_greek_str =  mes2ascii(inp_str)

Output:

Creating autoaligned llemmatizers O(|src_alphabet|x|dst_alphabet|)
Medium llemmatizer: |34|x|186|
1.97 s ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Large llemmatizer: |100|x|3549|
46.2 s ± 1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
Applying the medium and large llemmatizers on strings:

String size: 80
6.06 μs ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.94 μs ± 65 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

String size: 80000
361 μs ± 6.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
397 μs ± 3.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

String size: 80000000
499 ms ± 984 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
521 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Command Line Invocation

Evaluate Merges

ll_evaluate_merges -h # get help string with the cli interface
ll_evaluate_merges -corpus_glob  './sample_data/wienocist_charter_1/wienocist_charter_1*'

Attention the merge CER is not symetric at all!

# The following gives a CER of 0.0591
ll_evaluate_merges -corpus_glob  './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("I", "J"), ("i", "j")]'
# While the following gives a CER of 0.0007
ll_evaluate_merges -corpus_glob  './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("J", "I"), ("j", "i")]'

Extract corpus alphabet

ll_extract_corpus_alphabet -h # get help string with the cli interface
ll_extract_corpus_alphabet -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*'

Test corpus on alphabets

ll_test_corpus_on_alphabets -h # get help string with the cli interface
ll_test_corpus_on_alphabets -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*' -alphabets 'bmp_mufi,ascii,mes1,iso8859_2' -verbose

Demapping

Setup

mkdir -p tmp/models
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O ./tmp/tinyshakespeare.txt
cat ./tmp/tinyshakespeare.txt |shuf  --random-source ./tmp/tinyshakespeare.txt > ./tmp/tinyshakespeare_shuf.txt
head -n 1000 ./tmp/tinyshakespeare_shuf.txt > ./tmp/tinyshakespeare_test.txt
tail -n +1001 ./tmp/tinyshakespeare_shuf.txt > ./tmp/tinyshakespeare_exper.txt

Train a demapper

GPU is automatically employed if found

ll_train_one2one -corpus_files ./tmp/tinyshakespeare_exper.txt -output_model_path ./tmp/models/toy_model.pt -input_alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!",.:;? ' -output_alphabet '0abcdfghjklmnpqrstvwxz.' -nb_epochs 3

If a model has not been trained until nb_epochs, the training resumes.

ll_train_one2one -corpus_files ./tmp/tinyshakespeare_exper.txt -output_model_path ./tmp/models/toy_model.pt -input_alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!",.:;? ' -output_alphabet '0abcdfghjklmnpqrstvwxz.' -nb_epochs 50

Use a demapper

O demmaper can be use on streams or files

echo 'a da nat knaw what ta saa. bat a knaw what ta thank.' |ll_infer_one2one -model_path ./tmp/models/toy_model.pt

Output:

I do not know what to say, but a know what to think,

Development

Development Installation

For extending pylelemmatize, install from github.

git clone git@github.com:anguelos/pylelemmatize.git
cd pylelemmatize
pip install -r requirements
pip install -r ./docs/requirements.txt
pip install -e .

This will install pylelemmatize on your system in development mode.

Testing

Running the unit tests

pytest --cov ./src/pylelemmatize/ ./test/pytest/

Running shell script tests

This will run all bash scripts with -h essetially checking syntax and imports

./test/test_shell_scripts.sh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylelemmatize-0.1.1.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pylelemmatize-0.1.1-py3-none-any.whl (62.9 kB view details)

Uploaded Python 3

File details

Details for the file pylelemmatize-0.1.1.tar.gz.

File metadata

  • Download URL: pylelemmatize-0.1.1.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for pylelemmatize-0.1.1.tar.gz
Algorithm Hash digest
SHA256 12117e526b434711c9925d0281e505c4f499eadbc2f4c5dd8564c9c7f959f671
MD5 d4639652e623ceea24d97766ab530893
BLAKE2b-256 96d71059a17f81eb53db71ca6f454ae58b69258516813216e0e72cd1c0c28a4e

See more details on using hashes here.

File details

Details for the file pylelemmatize-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pylelemmatize-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 62.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for pylelemmatize-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5dd0f445ea5d4680b8c2defc789aaa0decac6b4a087e617534f445c41dd90e5
MD5 8e09740f8bdb801f9a828a784e335d1b
BLAKE2b-256 6a8d6be5035d4ef8994e52d878ce5a624ba5e869b32b18017e0ace4bd26e5995

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page