Skip to main content

A custom tokenizer for Swahili text using syllabic vocabulary with byte fallback.

Project description

Swahili Syllabic Tokenizer

This repository hosts a custom tokenizer for Swahili text, designed to tokenize text into syllables using a syllabic vocabulary. The tokenizer is compatible with the Hugging Face transformers library, making it easy to integrate into NLP pipelines and models.

Features

  • Syllabic Tokenization: Tokenizes Swahili text into syllables based on a predefined syllabic vocabulary.
  • Byte Fallback: Handles UTF-8 byte fallback for out-of-vocabulary tokens.
  • Customizable: Easily extendable and adaptable for specific NLP tasks.

Usage

Installation

pip install -r requirements.txt

Example

from hf_tokenizer import SilabiTokenizer

# Initialize the tokenizer
tokenizer = SilabiTokenizer()
# Encode a sample text
encoded_input = tokenizer("Hii ni mfano wa maandishi.")

# Decode the token ids back to text
decoded_text = tokenizer.decode(encoded_input['input_ids'])

print("Encoded Input:", encoded_input)
print("Decoded Text:", decoded_text)

Contributions

Contributions and suggestions are welcome! Feel free to open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silabi-tokenizer-0.4.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

silabi_tokenizer-0.4.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file silabi-tokenizer-0.4.0.tar.gz.

File metadata

  • Download URL: silabi-tokenizer-0.4.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for silabi-tokenizer-0.4.0.tar.gz
Algorithm Hash digest
SHA256 788e853861f9bdf1ac3f034c9b7d74ca2689c23078671df30465a9c5013d5bc4
MD5 75c64e067f871087c71ca11b37bfa2cd
BLAKE2b-256 db0891108f1251a3a396579e7eb759818403ec5d2d7df2203bccefa6debdf3d9

See more details on using hashes here.

File details

Details for the file silabi_tokenizer-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for silabi_tokenizer-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab7dba765e255cf5f913ad26e96a68e258f56f4ab9d6224fe845e4e9db666be5
MD5 6301bd3fe3d3541059fc491b81f2e7b1
BLAKE2b-256 23d71f1a598601655233316bb40c4d75b5534e2c9fdd28430147d78fcfc2c3e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page