Skip to main content

Grapheme Parser for indic languages

Project description

indicparser

Grapheme Parser for indic languages

Installaton

pip install indicparser

Useage

  • initializing the parser
from indicparser import graphemeParser
gp=graphemeParser("bangla")
  • extracting graphemes
text="  শাটিকাপ   মার"
graphemes=gp.process(text)
print("Graphemes:",graphemes)

Graphemes: [' ', ' ', 'শা', 'টি', 'কা', 'প', ' ', ' ', ' ', 'মা', 'র']

  • extracting graphemes but merging spaces and clearing initial and ending space
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes (space corrected):",graphemes)

Graphemes (space corrected): ['শা', 'টি', 'কা', 'প', ' ', 'মা', 'র']

  • treatment of numbers and puntucation and english is also available by default
text="এটাকি 2441139 ? না ভাই wrong number"
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes:",graphemes)

Graphemes: ['এ', 'টা', 'কি', ' ', '2', '4', '4', '1', '1', '3', '9', ' ', '?', ' ', 'না', ' ', 'ভা', 'ই', ' ', 'w', 'r', 'o', 'n', 'g', ' ', 'n', 'u', 'm', 'b', 'e', 'r']

  • available languages
from indicparser import languages
languages.keys()

dict_keys(['bangla', 'malyalam', 'tamil', 'gujrati', 'panjabi', 'odiya', 'hindi'])

Normalization

  • For best results use normalized text before parsing
  • An example bangla unicode normalizer can be found here

ABOUT

  • Authors: Bengali.AI

  • Cite Bengali.AI multipurpose grapheme dataset paper

@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}

Change Log

0.0.1 (12/02/2022)

  • First Release

0.0.2 (12/02/2022)

  • Basic Documentation
  • Modifier removal
  • space correction
  • text mode parser

0.0.3 (14/02/2022)

  • Connector ending
  • Exception case for component construction in bangla
  • Added test

0.0.4 (14/02/2022)

  • pip test stable
  • added malformed word detection

0.0.5 (19/02/2022)

  • encoding correction
  • no space char handling

0.0.6 (15/04/2022)

  • removed malformed word detection [not useful]
  • removed component calculation [not consistent]

0.0.7 (26/04/2022)

  • addition order correction

0.0.8 (21/10/2022)

  • allow middle Connector

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indicparser-0.0.8.tar.gz (430.2 kB view details)

Uploaded Source

File details

Details for the file indicparser-0.0.8.tar.gz.

File metadata

  • Download URL: indicparser-0.0.8.tar.gz
  • Upload date:
  • Size: 430.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for indicparser-0.0.8.tar.gz
Algorithm Hash digest
SHA256 3066cebf23f0f792d7b1824ca11898d5540086f5e426a80eb6a7caee0169dc06
MD5 25bb1eb94cc8a773ab631f9a8c229b11
BLAKE2b-256 3ef4355f0ad7d5904e84a1a549e942cef10a497e141425cfbd5cf8a041ab8e3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page