Skip to main content

A custom tokenizer for Swahili text using syllabic vocabulary with byte fallback.

Project description

Swahili Syllabic Tokenizer

This repository hosts a custom tokenizer for Swahili text, designed to tokenize text into syllables using a syllabic vocabulary. The tokenizer is compatible with the Hugging Face transformers library, making it easy to integrate into NLP pipelines and models.

Features

  • Syllabic Tokenization: Tokenizes Swahili text into syllables based on a predefined syllabic vocabulary.
  • Byte Fallback: Handles UTF-8 byte fallback for out-of-vocabulary tokens.
  • Customizable: Easily extendable and adaptable for specific NLP tasks.

Usage

Installation

pip install -r requirements.txt

Example

from hf_tokenizer import SilabiTokenizer

# Initialize the tokenizer
tokenizer = SilabiTokenizer()
# Encode a sample text
encoded_input = tokenizer("Hii ni mfano wa maandishi.")

# Decode the token ids back to text
decoded_text = tokenizer.decode(encoded_input['input_ids'])

print("Encoded Input:", encoded_input)
print("Decoded Text:", decoded_text)

Contributions

Contributions and suggestions are welcome! Feel free to open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silabi-tokenizer-0.5.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

silabi_tokenizer-0.5.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file silabi-tokenizer-0.5.0.tar.gz.

File metadata

  • Download URL: silabi-tokenizer-0.5.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for silabi-tokenizer-0.5.0.tar.gz
Algorithm Hash digest
SHA256 28bed2cdfe9acc9544a639a92762e66bc7b7eeffd493668ac5ba0078b35bcda5
MD5 3f75cc425ac8d4efd3c1a68d06b7b1e4
BLAKE2b-256 fdc0ae08bdef7b61fdd9e8ef298810d360ccac6f97b00c0e3a2ae9adfeb6c3dc

See more details on using hashes here.

File details

Details for the file silabi_tokenizer-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for silabi_tokenizer-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ebbeb55968abdd6f7bb3fcc410f5003eb1f6c1d41485da7ad6543f21409ee7b
MD5 2c5048ffa3e7c41dec053404b2f113a4
BLAKE2b-256 a17c6716d5c0b10d00d44c7a00bc812e48a8b51f954f49a21348ef7469af1743

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page