Skip to main content

A custom tokenizer for Swahili text using syllabic vocabulary with byte fallback.

Project description

Swahili Tokenization

  • I will update the readme.md with more information

Syllabic Tokenization with Byte Fallback

  • Syllabic Tokenization with Byte Fallbacks allows the foreign elements in the text.
  • Inspiration through SentencePiece. Will add citation
  • The resultant vocabulary size is small, approximately 1200.

Syllabic Tokenization

  • Kiswahili is a syllabic language
  • Tokenizes a sentence on the 219 Kiswahili syllables
  • I hypothesize that it'll allow the model to be syllable-aware. I will provide more information concerning the syllabic language and references later

Byte Fallback

  • To items that do not appear as a syllable, they fallback to the utf-8 representation of the character
  • Allows tokenization of non-swahili elements that appear in the sentence. Simple example an English name such as john ('jo', ?)
  • Fallbacks to unknown token when all comes to nothing.

Example Usage:

I will add some examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silabi-tokenizer-0.1.0.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

silabi_tokenizer-0.1.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file silabi-tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: silabi-tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for silabi-tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e1c77e8d2c97d6c33301165b143949026b86b1f327fe6d950a5fe93c9245a473
MD5 42c4da84c36ed71ffc76e0398f85d61c
BLAKE2b-256 ead3bffc0d8d90e52d48672a716983c22631a86d2a94d9bb152b6e47ce40bdda

See more details on using hashes here.

File details

Details for the file silabi_tokenizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for silabi_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a6f1dbd262afbd12347d82eefeee7dacfcc9483e544b6bf3ebc3bf88c0bf848
MD5 87b6b747418a4890562e57eebb90f8bf
BLAKE2b-256 53ec790a8e8389edd66eb32c1b3ea782a08db649cce3576f7b3275f3d0549efa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page