Skip to main content

A custom tokenizer for Swahili text using syllabic vocabulary with byte fallback.

Project description

Swahili Syllabic Tokenizer

This repository hosts a custom tokenizer for Swahili text, designed to tokenize text into syllables using a syllabic vocabulary. The tokenizer is compatible with the Hugging Face transformers library, making it easy to integrate into NLP pipelines and models.

Features

  • Syllabic Tokenization: Tokenizes Swahili text into syllables based on a predefined syllabic vocabulary.
  • Byte Fallback: Handles UTF-8 byte fallback for out-of-vocabulary tokens.
  • Customizable: Easily extendable and adaptable for specific NLP tasks.

Usage

Installation

pip install -r requirements.txt

Example

from hf_tokenizer import SilabiTokenizer

# Initialize the tokenizer
tokenizer = SilabiTokenizer()
# Encode a sample text
encoded_input = tokenizer("Hii ni mfano wa maandishi.")

# Decode the token ids back to text
decoded_text = tokenizer.decode(encoded_input['input_ids'])

print("Encoded Input:", encoded_input)
print("Decoded Text:", decoded_text)

Contributions

Contributions and suggestions are welcome! Feel free to open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silabi-tokenizer-0.2.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

silabi_tokenizer-0.2.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file silabi-tokenizer-0.2.0.tar.gz.

File metadata

  • Download URL: silabi-tokenizer-0.2.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for silabi-tokenizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2b52424a69a608c3ebda1070a67682761a5c75026af0b2534db3121e4d16cbd8
MD5 fc4d4eedc5361f01f09e19b2ca46d013
BLAKE2b-256 84f2d55d633823f565d88bdc3554eec97892f2c289c6673a627c21c3fd3a9612

See more details on using hashes here.

File details

Details for the file silabi_tokenizer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for silabi_tokenizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c3af9a136b9a82ef22b52917ad575b28fb98b0923db65d43309b12937d88bc0
MD5 14f2bf974408e6a7ca5348ea27f9bdaa
BLAKE2b-256 3293427c39d44b2266aa5f623debbac4285f6f328142da75feb61ba40cf907af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page