A set utilities for hadling alphabets of corpora and OCR/HTR datasets
Project description
Getting started
A framework for assisting transliterations and character-sets in python.
PyLeLemmatize is a Python package for lemmatizing characters. It provides a simple and efficient way to reduce large character sets to simpler ones.
Installation
Install from pypi
To install PyLemmatize from Pypi:
pip install pylelemmatize
for installation for coding, look at [development](### Development Installation)
Python Usage
Simple letter lemmatization
import pylelemmatize as ll
greek_poly_string = "Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον."
print(f"Polytonic : {greek_poly_string}")
print(f"Modern Greek: {ll.llemmatize(greek_poly_string, ll.charsets.iso_8859_7)}")
print(f"ASCII : {ll.llemmatize(greek_poly_string, ll.charsets.ascii)}")
Output:
Polytonic : Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον.
Modern Greek: Καί ότε ήνοιξεν τήν σφραγίδα τήν έβδόμην, έγένετο σιγή έν τώ ούρανώ ώς ήμιώριον.
ASCII : Kai ote enoixen ten spragida ten ebdomen, egeneto sige en to ourano os emiorion.
Efficient letter lemmatization
Creating automoatic llemmatizers is expencive O(|input_alphabet|x|output_alphabet|) Once they are created they are equally fast regardless of of their sizes. The following IPython codesnipet demonstrates the cost of creating vs applying llemmatizers.
import pylelemmatize as ll
greek_poly_string = "Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον."
print("Creating autoaligned llemmatizers O(|src_alphabet|x|dst_alphabet|)")
print("Medium llemmatizer: |34|x|186|")
%timeit polytonic2modern_greek = ll.llemmatizer(greek_poly_string, ll.charsets.iso_8859_7)
polytonic2modern_greek = ll.llemmatizer(greek_poly_string, ll.charsets.iso_8859_7)
print("Large llemmatizer: |100|x|3549|")
%timeit mes2ascii = ll.llemmatizer(ll.charsets.mes3a, ll.charsets.ascii)
mes2ascii = ll.llemmatizer(ll.charsets.mes3a, ll.charsets.ascii)
print("\nApplying the medium and large llemmatizers on strings:")
for inp_str in [greek_poly_string, greek_poly_string * 1000, greek_poly_string * 1000000]:
modern_greek_str = polytonic2modern_greek(inp_str)
print(f"\nString size: {len(inp_str)}")
%timeit modern_greek_str = polytonic2modern_greek(inp_str)
modern_greek_str = polytonic2modern_greek(inp_str)
%timeit modern_greek_str = mes2ascii(inp_str)
Output:
Creating autoaligned llemmatizers O(|src_alphabet|x|dst_alphabet|)
Medium llemmatizer: |34|x|186|
1.97 s ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Large llemmatizer: |100|x|3549|
46.2 s ± 1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Applying the medium and large llemmatizers on strings:
String size: 80
6.06 μs ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.94 μs ± 65 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
String size: 80000
361 μs ± 6.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
397 μs ± 3.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
String size: 80000000
499 ms ± 984 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
521 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
PHOC string embedding
Pyramyd Histogram Of Characters (PHOC) embeddings have been implemented as a pytorch layer.
import torch,pylelemmatize as ll
phoc = ll.PHOC()
print(torch.norm(phoc("hello")-phoc("hell")))
Command Line Invocation
Demapping
Training and using RNNs that reverse character mappings can be done on the CLI without any code editing.
Setup
mkdir -p tmp/models
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O ./tmp/tinyshakespeare.txt
cat ./tmp/tinyshakespeare.txt |shuf --random-source ./tmp/tinyshakespeare.txt > ./tmp/tinyshakespeare_shuf.txt
head -n 1000 ./tmp/tinyshakespeare_shuf.txt > ./tmp/tinyshakespeare_test.txt
tail -n +1001 ./tmp/tinyshakespeare_shuf.txt > ./tmp/tinyshakespeare_exper.txt
Train a demapper
GPU is automatically employed if found
ll_train_one2one -corpus_files ./tmp/tinyshakespeare_exper.txt -output_model_path ./tmp/models/toy_model.pt -input_alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!",.:;? ' -output_alphabet '0abcdfghjklmnpqrstvwxz.' -nb_epochs 3
If a model has not been trained until nb_epochs, the training resumes.
ll_train_one2one -corpus_files ./tmp/tinyshakespeare_exper.txt -output_model_path ./tmp/models/toy_model.pt -input_alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!",.:;? ' -output_alphabet '0abcdfghjklmnpqrstvwxz.' -nb_epochs 50
Use a demapper
O demmaper can be use on streams or files
echo 'a da nat knaw what ta saa. bat a knaw what ta thank.' |ll_infer_one2one -model_path ./tmp/models/toy_model.pt
Output:
I do not know what to say, but a know what to think,
Evaluate Merges
Evaluating the CER introduced by merging multiple symbols to a single ones.
ll_evaluate_merges -h # get help string with the cli interface
ll_evaluate_merges -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("u", "v"), ("U", "V")]'
Attention the merge CER is not symetric at all!
# The following gives a CER of 0.0591
ll_evaluate_merges -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("I", "J"), ("i", "j")]'
# While the following gives a CER of 0.0007
ll_evaluate_merges -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("J", "I"), ("j", "i")]'
Extract corpus alphabet
ll_extract_corpus_alphabet -h # get help string with the cli interface
ll_extract_corpus_alphabet -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*'
Test corpus on alphabets
ll_test_corpus_on_alphabets -h # get help string with the cli interface
ll_test_corpus_on_alphabets -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*' -alphabets 'mufibmp,ascii,mes1,iso_8859_2' -verbose
Development
Development Installation
For extending pylelemmatize, install from github.
git clone git@github.com:anguelos/pylelemmatize.git
cd pylelemmatize
pip install -r requirements
pip install -r ./docs/requirements.txt
pip install -e .
This will install pylelemmatize on your system in development mode.
Testing
Running the unit tests
pytest --cov ./src/pylelemmatize/ ./test/pytest/
Running shell script tests
This will run all bash scripts with -h essetially checking syntax and imports
./test/test_shell_scripts.sh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pylelemmatize-0.2.1.tar.gz.
File metadata
- Download URL: pylelemmatize-0.2.1.tar.gz
- Upload date:
- Size: 70.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf1457032c5fef6f05f91fc8adaf5725b8f3f242933bc1c9e353ad5e26d16c48
|
|
| MD5 |
1a0f8dd1acf92e51d3d895fe670fa75b
|
|
| BLAKE2b-256 |
cea8873804d4232c9caf319f3492f2dbfbf76760c1d45453f04b47979d6fa921
|
File details
Details for the file pylelemmatize-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pylelemmatize-0.2.1-py3-none-any.whl
- Upload date:
- Size: 87.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07784b1617b33270d20d105696a94225f743c44fc17fb3a7836a6c5a3c871fe9
|
|
| MD5 |
ee2da1179666655f8a62776bbf7e468b
|
|
| BLAKE2b-256 |
f487617c6e48d9eb9324005079416dd086454871939cf41cb6de5c862b1b9d7c
|