Skip to main content

Grapheme Parser for indic languages

Project description

indicparser

Grapheme Parser for indic languages

Installaton

pip install indicparser

Useage

  • initializing the parser
from indicparser import graphemeParser
gp=graphemeParser("bangla")
  • extracting graphemes
text="  শাটিকাপ   মার"
graphemes=gp.process(text)
print("Graphemes:",graphemes)

Graphemes: [' ', ' ', 'শা', 'টি', 'কা', 'প', ' ', ' ', ' ', 'মা', 'র']

  • extracting graphemes but merging spaces and clearing initial and ending space
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes (space corrected):",graphemes)

Graphemes (space corrected): ['শা', 'টি', 'কা', 'প', ' ', 'মা', 'র']

  • treatment of numbers and puntucation and english is also available by default
text="এটাকি 2441139 ? না ভাই wrong number"
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes:",graphemes)

Graphemes: ['এ', 'টা', 'কি', ' ', '2', '4', '4', '1', '1', '3', '9', ' ', '?', ' ', 'না', ' ', 'ভা', 'ই', ' ', 'w', 'r', 'o', 'n', 'g', ' ', 'n', 'u', 'm', 'b', 'e', 'r']

  • available languages
from indicparser import languages
languages.keys()

dict_keys(['bangla', 'malyalam', 'tamil', 'gujrati', 'panjabi', 'odiya', 'hindi'])

Normalization

  • For best results use normalized text before parsing
  • An example bangla unicode normalizer can be found here

ABOUT

  • Authors: Bengali.AI

  • Cite Bengali.AI multipurpose grapheme dataset paper

@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}

Change Log

0.0.1 (12/02/2022)

  • First Release

0.0.2 (12/02/2022)

  • Basic Documentation
  • Modifier removal
  • space correction
  • text mode parser

0.0.3 (14/02/2022)

  • Connector ending
  • Exception case for component construction in bangla
  • Added test

0.0.4 (14/02/2022)

  • pip test stable
  • added malformed word detection

0.0.5 (19/02/2022)

  • encoding correction
  • no space char handling

0.0.6 (15/04/2022)

  • removed malformed word detection [not useful]
  • removed component calculation [not consistent]

0.0.7 (26/04/2022)

  • addition order correction

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indicparser-0.0.7.tar.gz (430.2 kB view details)

Uploaded Source

File details

Details for the file indicparser-0.0.7.tar.gz.

File metadata

  • Download URL: indicparser-0.0.7.tar.gz
  • Upload date:
  • Size: 430.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for indicparser-0.0.7.tar.gz
Algorithm Hash digest
SHA256 ac191a5441e5d40563cde9e887e5534f11b5be0b4dbbe0cca6d30c6620a5a9cf
MD5 e83433e2d1412a7a2a0d096efbc8da71
BLAKE2b-256 2744a8299c2d370e51e461b55a0ecc2afba1fda72353d1235f834b094b99a23f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page