A custom tokenizer for Swahili text using syllabic vocabulary with byte fallback.
Project description
Swahili Tokenization
- I will update the readme.md with more information
Syllabic Tokenization with Byte Fallback
- Syllabic Tokenization with Byte Fallbacks allows the foreign elements in the text.
- Inspiration through SentencePiece. Will add citation
- The resultant vocabulary size is small, approximately 1200.
Syllabic Tokenization
- Kiswahili is a syllabic language
- Tokenizes a sentence on the 219 Kiswahili syllables
- I hypothesize that it'll allow the model to be syllable-aware. I will provide more information concerning the syllabic language and references later
Byte Fallback
- To items that do not appear as a syllable, they fallback to the utf-8 representation of the character
- Allows tokenization of non-swahili elements that appear in the sentence. Simple example an English name such as john ('jo', ?)
- Fallbacks to unknown token when all comes to nothing.
Example Usage:
I will add some examples
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file silabi-tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: silabi-tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1c77e8d2c97d6c33301165b143949026b86b1f327fe6d950a5fe93c9245a473
|
|
| MD5 |
42c4da84c36ed71ffc76e0398f85d61c
|
|
| BLAKE2b-256 |
ead3bffc0d8d90e52d48672a716983c22631a86d2a94d9bb152b6e47ce40bdda
|
File details
Details for the file silabi_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: silabi_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a6f1dbd262afbd12347d82eefeee7dacfcc9483e544b6bf3ebc3bf88c0bf848
|
|
| MD5 |
87b6b747418a4890562e57eebb90f8bf
|
|
| BLAKE2b-256 |
53ec790a8e8389edd66eb32c1b3ea782a08db649cce3576f7b3275f3d0549efa
|