Grapheme Parser for indic languages
Project description
indicparser
Grapheme Parser for indic languages
Installaton
pip install indicparser
Useage
- initializing the parser
from indicparser import graphemeParser
gp=graphemeParser("bangla")
- extracting graphemes
text=" শাটিকাপ মার"
graphemes=gp.process(text)
print("Graphemes:",graphemes)
Graphemes: [' ', ' ', 'শা', 'টি', 'কা', 'প', ' ', ' ', ' ', 'মা', 'র']
- extracting grapheme root, consonant diacritics and vowel diacritics
comps=gp.process(text,return_graphemes=False)
print("Components:",comps)
Components: [' ', ' ', 'শ', 'া', 'ট', 'ি', 'ক', 'া', 'প', ' ', ' ', ' ', 'ম', 'া', 'র']
- extracting graphemes but merging spaces and clearing initial and ending space
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes (space corrected):",graphemes)
Graphemes (space corrected): ['শা', 'টি', 'কা', 'প', ' ', 'মা', 'র']
- treatment of numbers and puntucation and english is also available by default
text="এটাকি 2441139 ? না ভাই wrong number"
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes:",graphemes)
Graphemes: ['এ', 'টা', 'কি', ' ', '2', '4', '4', '1', '1', '3', '9', ' ', '?', ' ', 'না', ' ', 'ভা', 'ই', ' ', 'w', 'r', 'o', 'n', 'g', ' ', 'n', 'u', 'm', 'b', 'e', 'r']
- available languages
from indicparser import languages
languages.keys()
dict_keys(['bangla', 'malyalam', 'tamil', 'gujrati', 'panjabi', 'odiya', 'hindi'])
- malformed text detection examples
gp.process("পাশ্র্বের")
Malformed text-পাশ্র্বের possible text:পার্শ্বের
gp=graphemeParser("panjabi")
gp.process("ਕੋਲਡਡਿੰ੍ਰਕਸ")
Malformed text-ਕੋਲਡਡਿੰ੍ਰਕਸ possible text:ਕੋਲਡਡਿ੍ਰੰਕਸ
ABOUT
-
Authors: Bengali.AI
-
Cite Bengali.AI multipurpose grapheme dataset paper
@inproceedings{alam2021large,
title={A large multi-target dataset of common bengali handwritten graphemes},
author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
booktitle={International Conference on Document Analysis and Recognition},
pages={383--398},
year={2021},
organization={Springer}
}
Change Log
0.0.1 (12/02/2022)
- First Release
0.0.2 (12/02/2022)
- Basic Documentation
- Modifier removal
- space correction
- text mode parser
0.0.3 (14/02/2022)
- Connector ending
- Exception case for component construction in bangla
- Added test
0.0.4 (14/02/2022)
- pip test stable
- added malformed word detection
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file indicparser-0.0.5.tar.gz.
File metadata
- Download URL: indicparser-0.0.5.tar.gz
- Upload date:
- Size: 430.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b77f6ff5fb2a8b6e021c75a0e91ae9fff36ba2444b391ddc3557f7565a98d48f
|
|
| MD5 |
d108f8534baadac9394a87ce727a699e
|
|
| BLAKE2b-256 |
ac3ba1f013ad69ed6c5d5b7ce9ef90d91b5a8bd359caccafec4aedad5b36cddb
|