Skip to main content

Multilingual Partial Syllable Tokenization - A rule-based tokenization method designed to align with linguistic nuances while minimizing False Positive errors.

Project description

We would like to introduce Multilingual Partial Syllable Tokenization—a novel rule-based tokenization method that avoids breaking into complete syllables. Through experimentation, its utility has been uncovered in keyword detection, effectively minimizing False Positive errors and helping a lot in Burmese’s rules-based+machine learning name recognition. Notably, this tokenization method is designed to align with the linguistic nuances of languages, but without requiring an exhaustive understanding of each specific language. Now it is integrated with a frequency-based approach to generate tokens.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simbolotokenizer-0.1.0.tar.gz (3.0 kB view hashes)

Uploaded Source

Built Distribution

simbolotokenizer-0.1.0-py3-none-any.whl (1.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page