Skip to main content

Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text

Project description

Kin-Tokenizer

kin-tokenizer is a Python library designed for tokenizing Kinyarwanda language text. It can both encode and decode text in Kinyarwanda, it has a vocabulary size of 20,257.

Installation

You can install the package using pip:

pip install kin-tokenizer

Basis Usage

from kin_tokenizer import KinTokenizer  # Importing Tokenizer class

# Creating an instance of tokenizer
tokenizer = KinTokenizer()

# Loading the state of the tokenizer (pretrained tokenizer)
tokenizer.load()

# Encoding
text = "Nagiye gusura inshuti zanjye dusoma ibitabo"
tokens = tokenizer.encode(text)
print(tokens)

# Decoding
decoded_text = tokenizer.decode(tokens)
print(decoded_text)

# Printing the vocab size
print(tokenizer.vocab_size)

# Print vocabulary (first 1000 items)
count = 0
for k, v in tokenizer.vocab.items():
    print("{} : {}".format(k, v))
    count += 1
    if count > 1000:
        break

Training Your Own Tokenizer

You can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text.

from kin_tokenizer import KinTokenizer
from kin_tokenizer.utils import train_kin_tokenizer, create_sequences

# Training the tokenizer
tokenizer = train_kin_tokenizer(training_text, vocab_size=512, save=True, tokenizer_path=SAVE_PATH_ROOT)

# Creating sequences
x_seq, y_seq = create_sequences(tokens, seq_len=128)

Contributing

The project is still being updated and contributions are welcome. You can contribute by:

  • Reporting bugs
  • Suggesting features
  • Writing or improving documentation
  • Submitting pull requests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kin_tokenizer-3.1.tar.gz (708.5 kB view details)

Uploaded Source

Built Distribution

kin_tokenizer-3.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file kin_tokenizer-3.1.tar.gz.

File metadata

  • Download URL: kin_tokenizer-3.1.tar.gz
  • Upload date:
  • Size: 708.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for kin_tokenizer-3.1.tar.gz
Algorithm Hash digest
SHA256 b64329519d0b10beaa67f52e85af7a6b9716097739a3177f8decbc637d7aedef
MD5 178adbeb46e51ead12666f998a77e0a3
BLAKE2b-256 57f485c1fd23943d4502a775a384f4ee629f2b2824861a59c84bbe8bad312ca6

See more details on using hashes here.

File details

Details for the file kin_tokenizer-3.1-py3-none-any.whl.

File metadata

  • Download URL: kin_tokenizer-3.1-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for kin_tokenizer-3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5f5c45e0860e58a4fd9f35362257fb1bba0db783a84d0779b8b1452a4f84eecf
MD5 995e124855fd6b9847fcb685d580ae30
BLAKE2b-256 0ea8427c9233cc6016c1e9d617ee8438817e6ff807a8ed65b97906e4a4f2eb00

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page