Olchiki Unicode Normalization Toolkit
Project description
olunicodenormalizer
ᱚᱞ-ᱪᱦᱤᱠᱤ Unicode Normalization for word normalization
install
pip install olunicodenormalizer
useage
initialization and cleaning
# import
from olunicodenormalizer import Normalizer
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'ᱡᱚᱦᱟᱨ'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)
output
Non-norm:ᱡᱚᱦᱟᱨ; Norm:ᱡᱚᱦᱟᱨ
--------------------------------------------------
{'given': 'ᱡᱚᱦᱟᱨ', 'normalized': 'ᱡᱚᱦᱟᱨ', 'ops': []}
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])
output
without english: None
with english: ASD123
Change Log
0.0.5 (9/03/2022)
- added details for execution map
- checkop typo correction
0.0.6 (9/03/2022)
- broken diacritics op addition
0.0.7 (11/03/2022)
- assemese replacement
- word op and unicode op mapping
- modifier list modification
- doc string for call and initialization
- verbosity removal
- typo correction for operation
- unit test updates
- 'এ' replacement correction
- NonGylphUnicodes
- Legacy symbols option
- legacy mapper added
- added bn:bd declaration
0.0.8 (14/03/2022)
- MultipleConsonantDiacritics handling change
- to+hosonto correction
- invalid hosonto correction
0.0.9 (15/04/2022)
- base normalizer
- language class
- olchiki extension
- complex root normalization
0.0.10 (15/04/2022)
- added conjucts
- exception for english words
0.0.11 (15/04/2022)
- fixed no space char issue for olchiki
0.0.12 (26/04/2022)
- fixed consonants orders
0.0.13 (26/04/2022)
- fixed non char followed by diacritics
0.0.14 (01/05/2022)
- word based normalization
- encoding fix
0.0.15 (02/05/2022)
- import correction
0.0.16 (02/05/2022)
- local variable issue
0.0.17 (17/05/2022)
- nukta mod break
0.0.18 (08/06/2022)
- no space chars fix
0.0.19 (15/06/2022)
- no space chars further fix
- base_olchiki_compose to avoid false op flags
- added foreign conjuncts
0.0.20 (01/08/2022)
- এ্যা replacement correction
0.0.21 (01/08/2022)
- "য","ব" + hosonto combination correction
- added 'ব্ল্য' in conjuncts
0.0.22 (22/08/2022)
- \u200d combination limiting
0.0.23 (23/08/2022)
- \u200d condition change
0.0.24 (26/08/2022)
- \u200d error handling
0.0.25 (10/09/22)
- removed unnecessary operations: fixRefOrder,fixOrdersForCC
- added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল'
0.1.0 (20/10/22)
- added indic parser
- fixed language class
0.1.1 (21/10/22)
- added nukta and diacritic maps for indics
- cleaned conjucts for now
- fixed issues with no-space and connector
0.1.2 (10/12/22)
- allow halant ending for indic language except olchiki
0.1.3 (10/12/22)
- broken char break cases for halant
0.1.4 (01/01/23)
- added sylhetinagri
0.1.5 (01/01/23)
- cleaned panjabi double quotes in diac map
0.0.1 (26/08/23)
- added olchiki punctuations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
olunicodenormalizer-1.0.0.tar.gz
(19.9 kB
view details)
Built Distribution
File details
Details for the file olunicodenormalizer-1.0.0.tar.gz
.
File metadata
- Download URL: olunicodenormalizer-1.0.0.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de9eb75611d7315f58af3401d7c12f1b9387e9cb699d5a4ebeab90b11b1bd8ab |
|
MD5 | 4b06157b41c7c12385554e9c5ca5e89b |
|
BLAKE2b-256 | 9c782214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5 |
File details
Details for the file olunicodenormalizer-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: olunicodenormalizer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fbf46c8cd3d1d4f8beaf78aff0a0c5eea1732174f4c0d8c9d891384da0c0126 |
|
MD5 | 91b5a26f5175b996a9ecd29a90357193 |
|
BLAKE2b-256 | 93f0464ee6d8c35dc4b7ab1928c999b9f918fc233fda44e6c6d85a92c6f548c1 |