Skip to main content

Olchiki Unicode Normalization Toolkit

Project description

olunicodenormalizer

ᱚᱞ-ᱪᱦᱤᱠᱤ Unicode Normalization for word normalization

install

pip install olunicodenormalizer

useage

initialization and cleaning

# import
from olunicodenormalizer import Normalizer 
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'ᱡᱚᱦᱟᱨ'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)

output

Non-norm:ᱡᱚᱦᱟᱨ; Norm:ᱡᱚᱦᱟᱨ
--------------------------------------------------
{'given': 'ᱡᱚᱦᱟᱨ', 'normalized': 'ᱡᱚᱦᱟᱨ', 'ops': []}
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])

output

without english: None
with english: ASD123

Change Log

0.0.5 (9/03/2022)

  • added details for execution map
  • checkop typo correction

0.0.6 (9/03/2022)

  • broken diacritics op addition

0.0.7 (11/03/2022)

  • assemese replacement
  • word op and unicode op mapping
  • modifier list modification
  • doc string for call and initialization
  • verbosity removal
  • typo correction for operation
  • unit test updates
  • 'এ' replacement correction
  • NonGylphUnicodes
  • Legacy symbols option
  • legacy mapper added
  • added bn:bd declaration

0.0.8 (14/03/2022)

  • MultipleConsonantDiacritics handling change
  • to+hosonto correction
  • invalid hosonto correction

0.0.9 (15/04/2022)

  • base normalizer
  • language class
  • olchiki extension
  • complex root normalization

0.0.10 (15/04/2022)

  • added conjucts
  • exception for english words

0.0.11 (15/04/2022)

  • fixed no space char issue for olchiki

0.0.12 (26/04/2022)

  • fixed consonants orders

0.0.13 (26/04/2022)

  • fixed non char followed by diacritics

0.0.14 (01/05/2022)

  • word based normalization
  • encoding fix

0.0.15 (02/05/2022)

  • import correction

0.0.16 (02/05/2022)

  • local variable issue

0.0.17 (17/05/2022)

  • nukta mod break

0.0.18 (08/06/2022)

  • no space chars fix

0.0.19 (15/06/2022)

  • no space chars further fix
  • base_olchiki_compose to avoid false op flags
  • added foreign conjuncts

0.0.20 (01/08/2022)

  • এ্যা replacement correction

0.0.21 (01/08/2022)

  • "য","ব" + hosonto combination correction
  • added 'ব্ল্য' in conjuncts

0.0.22 (22/08/2022)

  • \u200d combination limiting

0.0.23 (23/08/2022)

  • \u200d condition change

0.0.24 (26/08/2022)

  • \u200d error handling

0.0.25 (10/09/22)

  • removed unnecessary operations: fixRefOrder,fixOrdersForCC
  • added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল'

0.1.0 (20/10/22)

  • added indic parser
  • fixed language class

0.1.1 (21/10/22)

  • added nukta and diacritic maps for indics
  • cleaned conjucts for now
  • fixed issues with no-space and connector

0.1.2 (10/12/22)

  • allow halant ending for indic language except olchiki

0.1.3 (10/12/22)

  • broken char break cases for halant

0.1.4 (01/01/23)

  • added sylhetinagri

0.1.5 (01/01/23)

  • cleaned panjabi double quotes in diac map

0.0.1 (26/08/23)

  • added olchiki punctuations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olunicodenormalizer-1.0.0.tar.gz (19.9 kB view hashes)

Uploaded Source

Built Distribution

olunicodenormalizer-1.0.0-py3-none-any.whl (18.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page