Skip to main content

Grapheme Parser for indic languages

Project description

indicparser

Grapheme Parser for indic languages

Installaton

pip install indicparser

Useage

  • initializing the parser
from indicparser import graphemeParser
gp=graphemeParser("bangla")
  • extracting graphemes
text="  শাটিকাপ   মার"
graphemes=gp.process(text)
print("Graphemes:",graphemes)

Graphemes: [' ', ' ', 'শা', 'টি', 'কা', 'প', ' ', ' ', ' ', 'মা', 'র']

  • extracting grapheme root, consonant diacritics and vowel diacritics
comps=gp.process(text,return_graphemes=False)
print("Components:",comps)

Components: [' ', ' ', 'শ', 'া', 'ট', 'ি', 'ক', 'া', 'প', ' ', ' ', ' ', 'ম', 'া', 'র']

  • extracting graphemes but merging spaces and clearing initial and ending space
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes (space corrected):",graphemes)

Graphemes (space corrected): ['শা', 'টি', 'কা', 'প', ' ', 'মা', 'র']

  • treatment of numbers and puntucation and english is also available by default
text="এটাকি 2441139 ? না ভাই wrong number"
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes:",graphemes)

Graphemes: ['এ', 'টা', 'কি', ' ', '2', '4', '4', '1', '1', '3', '9', ' ', '?', ' ', 'না', ' ', 'ভা', 'ই', ' ', 'w', 'r', 'o', 'n', 'g', ' ', 'n', 'u', 'm', 'b', 'e', 'r']

  • available languages
from indicparser import languages
languages.keys()

dict_keys(['bangla', 'malyalam', 'tamil', 'gujrati', 'panjabi', 'odiya', 'hindi'])

ABOUT

  • Authors: Bengali.AI

  • Cite Bengali.AI multipurpose grapheme dataset paper

@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}

Change Log

0.0.1 (12/02/2022)

  • First Release

0.0.2 (12/02/2022)

  • Basic Documentation
  • Modifier removal
  • space correction
  • text mode parser

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indicparser-0.0.2.tar.gz (5.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page