Skip to main content

Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text

Project description

Kin-Tokenizer

kin-tokenizer is a Python library designed for tokenizing Kinyarwanda language text. It can both encode and decode text in Kinyarwanda, it has a vocabulary size of 20,257.

Installation

You can install the package using pip:

pip install kin-tokenizer

Basis Usage

from kin_tokenizer import KinTokenizer  # Importing Tokenizer class

# Creating an instance of tokenizer
tokenizer = KinTokenizer()

# Loading the state of the tokenizer (pretrained tokenizer)
tokenizer.load()

# Encoding
text = "Nagiye gusura inshuti zanjye dusoma ibitabo"
tokens = tokenizer.encode(text)
print(tokens)

# Decoding
decoded_text = tokenizer.decode(tokens)
print(decoded_text)

# Printing the vocab size
print(tokenizer.vocab_size)

# Print vocabulary (first 1000 items)
count = 0
for k, v in tokenizer.vocab.items():
    print("{} : {}".format(k, v))
    count += 1
    if count > 1000:
        break

Training Your Own Tokenizer

You can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text.

from kin_tokenizer import KinTokenizer
from kin_tokenizer.utils import train_kin_tokenizer, create_sequences

# Training the tokenizer
tokenizer = train_kin_tokenizer(training_text, vocab_size=512, save=True, tokenizer_path=SAVE_PATH_ROOT)

# Creating sequences
x_seq, y_seq = create_sequences(tokens, seq_len=128)

Contributing

The project is still being updated and contributions are welcome. You can contribute by:

  • Reporting bugs
  • Suggesting features
  • Writing or improving documentation
  • Submitting pull requests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kin_tokenizer-3.3.tar.gz (708.6 kB view details)

Uploaded Source

Built Distribution

kin_tokenizer-3.3-py3-none-any.whl (717.9 kB view details)

Uploaded Python 3

File details

Details for the file kin_tokenizer-3.3.tar.gz.

File metadata

  • Download URL: kin_tokenizer-3.3.tar.gz
  • Upload date:
  • Size: 708.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for kin_tokenizer-3.3.tar.gz
Algorithm Hash digest
SHA256 5c48ddf2455b25689f7d9d173e7d5b8a954fe5ce0e3e27af1ff2b917d59aa1b0
MD5 47175a7389ebaf646c11b3498cba2ae2
BLAKE2b-256 e23b886633463324b7b85eb934391b5ce058f5cf9ff5af4ef606bd61066949f6

See more details on using hashes here.

File details

Details for the file kin_tokenizer-3.3-py3-none-any.whl.

File metadata

  • Download URL: kin_tokenizer-3.3-py3-none-any.whl
  • Upload date:
  • Size: 717.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for kin_tokenizer-3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 af725183a66d5bf379e045b3e8b40c0ec56aa354a24176b1634e76fa958056c8
MD5 71cce99377e187226a7c2b5104a2be13
BLAKE2b-256 9e7140296a4179edadcf4f80ccc84dfc1398dcd33bd998f37f78407c5e06d440

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page