Bangla Unicode Normalization Toolkit
Project description
bnUnicodeNormalizer
Bangla Unicode Normalization
install
pip install bnunicodenormalizer
useage
# import
from bnunicodenormalizer import Normalizer
# initialize
bnorm=Normalizer()
# normalize
word='াআমাকো'
print(f"Non-norm:{word}; Norm:{bnorm(word)}")
Non-norm:াআমাকো; Norm:আমাকো
Cases
In all examples (a) is the non-normalized form and (b) is the normalized form
- Broken Vowel and consonanr diacritics
# Example-1:
(a)'আরো'==(b)'আরো' -> False
(a) breaks as:['আ', 'র', 'ে', 'া']
(b) breaks as:['আ', 'র', 'ো']
# Example-2:
(a)পৌঁছে==(b)পৌঁছে -> False
(a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে']
(b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে']
# Example-3:
(a)সংস্কৄতি==(b)সংস্কৃতি -> False
(a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি']
(b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি']
- Broken nukta unicode
Example-1:
(a)কেন্দ্রীয়==(b)কেন্দ্রীয় -> False
(a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়']
(b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়']
Example-2:
(a)রযে়ছে==(b)রয়েছে -> False
(a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে']
(b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে']
Example-3:
(a)জ়ন্য==(b)জন্য -> False
(a) breaks as:['জ', '়', 'ন', '্', 'য']
(b) breaks as:['জ', 'ন', '্', 'য']
- Invalid hosontos that come after / before the vowels and the modifiers
# Example-1:
(a)দুই্টি==(b)দুইটি-->False
(a) breaks as ['দ', 'ু', 'ই', '্', 'ট', 'ি']
(b) breaks as ['দ', 'ু', 'ই', 'ট', 'ি']
# Example-2:
(a)এ্তে==(b)এতে-->False
(a) breaks as ['এ', '্', 'ত', 'ে']
(b) breaks as ['এ', 'ত', 'ে']
# Example-3:
(a)নেট্ওয়ার্ক==(b)নেটওয়ার্ক-->False
(a) breaks as ['ন', 'ে', 'ট', '্', 'ও', 'য়', 'া', 'র', '্', 'ক']
(b) breaks as ['ন', 'ে', 'ট', 'ও', 'য়', 'া', 'র', '্', 'ক']
# Example-4:
(a)এস্আই==(b)এসআই-->False
(a) breaks as ['এ', 'স', '্', 'আ', 'ই']
(b) breaks as ['এ', 'স', 'আ', 'ই']
- Invalid hosonto is in between two vowel diacritics
# Example-1:
(a)'চু্ক্তি'==(b)'চুক্তি' -> False
(a) breaks as:['চ', 'ু', '্', 'ক', '্', 'ত', 'ি']
(b) breaks as:['চ', 'ু','ক', '্', 'ত', 'ি']
# Example-2:
(a)'যু্ক্ত'==(b)'যুক্ত' -> False
(a) breaks as:['য', 'ু', '্', 'ক', '্', 'ত']
(b) breaks as:['য', 'ু', 'ক', '্', 'ত']
# Example-3:
(a)'কিছু্ই'==(b)'কিছুই' -> False
(a) breaks as:['ক', 'ি', 'ছ', 'ু', '্', 'ই']
(b) breaks as:['ক', 'ি', 'ছ', 'ু','ই']
- 'ত'+hosonto
# Example-1:
(a)বুত্পত্তি==(b)বুৎপত্তি-->False
(a) breaks as ['ব', 'ু', 'ত', '্', 'প', 'ত', '্', 'ত', 'ি']
(b) breaks as ['ব', 'ু', 'ৎ', 'প', 'ত', '্', 'ত', 'ি']
# Example-2:
(a)উত্স==(b)উৎস-->False
(a) breaks as ['উ', 'ত', '্', 'স']
(b) breaks as ['উ', 'ৎ', 'স']
- Unwanted consecutive double diacritics
# Example-1:
(a)'যুুদ্ধ'==(b)'যুদ্ধ' -> False
(a) breaks as:['য', 'ু', 'ু', 'দ', '্', 'ধ']
(b) breaks as:['য', 'ু', 'দ', '্', 'ধ']
# Example-2:
(a)'দুুই'==(b)'দুই' -> False
(a) breaks as:['দ', 'ু', 'ু', 'ই']
(b) breaks as:['দ', 'ু', 'ই']
# Example-3:
(a)'প্রকৃৃতির'==(b)'প্রকৃতির' -> False
(a) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ৃ', 'ত', 'ি', 'র']
(b) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ত', 'ি', 'র']
# Example-4:
(a)আমাকোা==(b)'আমাকো'-> False
(a) breaks as:['আ', 'ম', 'া', 'ক', 'ে', 'া', 'া']
(b) breaks as:['আ', 'ম', 'া', 'ক', 'ো']
- vowels followed by vowel diacritics
# Example-1:
(a)উুলু==(b)উলু-->False
(a) breaks as ['উ', 'ু', 'ল', 'ু']
(b) breaks as ['উ', 'ল', 'ু']
# Example-2:
(a)আর্কিওোলজি==(b)আর্কিওলজি-->False
(a) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ো', 'ল', 'জ', 'ি']
(b) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ল', 'জ', 'ি']
Also Normalizes 'এ' and 'ত্র'
# Example-1:
(a)একএে==(b)একত্রে-->False
(a) breaks as ['এ', 'ক', 'এ', 'ে']
(b) breaks as ['এ', 'ক', 'ত', '্', 'র', 'ে']
# Example-2:
(a)একএ==(b)একত্র-->False
(a) breaks as ['এ', 'ক', 'এ']
(b) breaks as ['এ', 'ক', 'ত', '্', 'র']
- Repeated consonant diacritics (folas)
# Example-1:
(a)গ্র্রামকে==(b)গ্রামকে-->False
(a) breaks as ['গ', '্', 'র', '্', 'র', 'া', 'ম', 'ক', 'ে']
(b) breaks as ['গ', '্', 'র', 'া', 'ম', 'ক', 'ে']
- Removes invalid starts and ends
Change Log
0.0.1 (15/02/2022)
- First Release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Close
Hashes for bnunicodenormalizer-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d0e86eee5e99ba51471f76bdbd45b78e743ea949564291b51d9e5a8dc920614 |
|
MD5 | 22ad89d4a9be775d1240be948ae80d39 |
|
BLAKE2b-256 | 2c775a38c42dd13782ec1c6ffa14643cee66fa44de1c98308b6be5e76ca63e77 |