A custom tokenizer for Swahili text using syllabic vocabulary with byte fallback.
Project description
Swahili Syllabic Tokenizer
This repository hosts a custom tokenizer for Swahili text, designed to tokenize text into syllables using a syllabic vocabulary. The tokenizer is compatible with the Hugging Face transformers library, making it easy to integrate into NLP pipelines and models.
Features
- Syllabic Tokenization: Tokenizes Swahili text into syllables based on a predefined syllabic vocabulary.
- Byte Fallback: Handles UTF-8 byte fallback for out-of-vocabulary tokens.
- Customizable: Easily extendable and adaptable for specific NLP tasks.
Usage
Installation
pip install -r requirements.txt
Example
from hf_tokenizer import SilabiTokenizer
# Initialize the tokenizer
tokenizer = SilabiTokenizer()
# Encode a sample text
encoded_input = tokenizer("Hii ni mfano wa maandishi.")
# Decode the token ids back to text
decoded_text = tokenizer.decode(encoded_input['input_ids'])
print("Encoded Input:", encoded_input)
print("Decoded Text:", decoded_text)
Contributions
Contributions and suggestions are welcome! Feel free to open an issue or submit a pull request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file silabi-tokenizer-0.5.0.tar.gz.
File metadata
- Download URL: silabi-tokenizer-0.5.0.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28bed2cdfe9acc9544a639a92762e66bc7b7eeffd493668ac5ba0078b35bcda5
|
|
| MD5 |
3f75cc425ac8d4efd3c1a68d06b7b1e4
|
|
| BLAKE2b-256 |
fdc0ae08bdef7b61fdd9e8ef298810d360ccac6f97b00c0e3a2ae9adfeb6c3dc
|
File details
Details for the file silabi_tokenizer-0.5.0-py3-none-any.whl.
File metadata
- Download URL: silabi_tokenizer-0.5.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ebbeb55968abdd6f7bb3fcc410f5003eb1f6c1d41485da7ad6543f21409ee7b
|
|
| MD5 |
2c5048ffa3e7c41dec053404b2f113a4
|
|
| BLAKE2b-256 |
a17c6716d5c0b10d00d44c7a00bc812e48a8b51f954f49a21348ef7469af1743
|