simbolotokenizer

Multilingual Partial Syllable Tokenization - A rule-based tokenization method designed to align with linguistic nuances while minimizing False Positive errors.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

We would like to introduce Multilingual Partial Syllable Tokenization—a novel rule-based tokenization method that avoids breaking into complete syllables. Through experimentation, its utility has been uncovered in keyword detection, effectively minimizing False Positive errors and helping a lot in Burmese’s rules-based+machine learning name recognition. Notably, this tokenization method is designed to align with the linguistic nuances of languages, but without requiring an exhaustive understanding of each specific language. Now it is integrated with a frequency-based approach to generate tokens.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Jan 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simbolotokenizer-0.1.0.tar.gz (3.0 kB view hashes)

Uploaded Jan 11, 2024 Source

Built Distribution

simbolotokenizer-0.1.0-py3-none-any.whl (1.5 kB view hashes)

Uploaded Jan 11, 2024 Python 3

Hashes for simbolotokenizer-0.1.0.tar.gz

Hashes for simbolotokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a93cecaef4d105701f02e321e556afc43517a7bc3d9d6eeaead8e0dfc980764d`
MD5	`2251a5eb414d70cb3530ceaa1c6b836e`
BLAKE2b-256	`b67ab0a5a635af67eca633ca3b6d67284b94de6b8d69c8f7dc5646ae276b8694`

Hashes for simbolotokenizer-0.1.0-py3-none-any.whl

Hashes for simbolotokenizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8a85bce7540390fe0e03f9000b2553904bfa343a3d514e482bc9f565b93174ed`
MD5	`011644f19abad10e6281b17d7b30e163`
BLAKE2b-256	`4a870864c39be2b237376385f451bc45095a538904093bffef7064e33aad560a`