Skip to main content

Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text

Project description

Kin-Tokenizer

kin-tokenizer is a Python library designed for tokenizing Kinyarwanda language text. It can both encode and decode text in Kinyarwanda, it has a vocabulary size of 20,257.

Installation

You can install the package using pip:

pip install kin-tokenizer

Basis Usage

from kin_tokenizer import KinTokenizer  # Importing Tokenizer class

# Creating an instance of tokenizer
tokenizer = KinTokenizer()

# Loading the state of the tokenizer (pretrained tokenizer)
tokenizer.load()

# Encoding
text = "Nagiye gusura inshuti zanjye dusoma ibitabo"
tokens = tokenizer.encode(text)
print(tokens)

# Decoding
decoded_text = tokenizer.decode(tokens)
print(decoded_text)

# Printing the vocab size
print(tokenizer.vocab_size)

# Print vocabulary (first 1000 items)
count = 0
for k, v in tokenizer.vocab.items():
    print("{} : {}".format(k, v))
    count += 1
    if count > 1000:
        break

Training Your Own Tokenizer

You can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text.

from kin_tokenizer import KinTokenizer
from kin_tokenizer.utils import train_kin_tokenizer, create_sequences

# Training the tokenizer
tokenizer = train_kin_tokenizer(training_text, vocab_size=512, save=True, tokenizer_path=SAVE_PATH_ROOT)

# Creating sequences
x_seq, y_seq = create_sequences(tokens, seq_len=128)

Contributing

The project is still being updated and contributions are welcome. You can contribute by:

  • Reporting bugs
  • Suggesting features
  • Writing or improving documentation
  • Submitting pull requests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kin_tokenizer-3.2.tar.gz (708.6 kB view details)

Uploaded Source

Built Distribution

kin_tokenizer-3.2-py3-none-any.whl (717.9 kB view details)

Uploaded Python 3

File details

Details for the file kin_tokenizer-3.2.tar.gz.

File metadata

  • Download URL: kin_tokenizer-3.2.tar.gz
  • Upload date:
  • Size: 708.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for kin_tokenizer-3.2.tar.gz
Algorithm Hash digest
SHA256 a1cda1d7a14cc04116d8020c3180bac3479720cc70c26afa10dc63eb937b8c56
MD5 e43e1a09e99f59b574deadfcadde58cc
BLAKE2b-256 139a7019a0b606ec15dce0c96075bc2da67cbba576558bad233a037603298ad9

See more details on using hashes here.

File details

Details for the file kin_tokenizer-3.2-py3-none-any.whl.

File metadata

  • Download URL: kin_tokenizer-3.2-py3-none-any.whl
  • Upload date:
  • Size: 717.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for kin_tokenizer-3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a3f89a568f772620181746dbc8d81e16eb3983c7f5a580c573b37fd27442f98c
MD5 03061d986236865b06e1bf8124faead3
BLAKE2b-256 332920815cadec6fb1b96cef24f089c3e64fd3943d6db58e34f9952b0d413fac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page