Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text
Project description
Kin-Tokenizer
kin-tokenizer
is a Python library designed for tokenizing Kinyarwanda language text. It can both encode and decode text in Kinyarwanda, it has a vocabulary size of 20,257.
Installation
You can install the package using pip:
pip install kin-tokenizer
Basis Usage
from kin_tokenizer import KinTokenizer # Importing Tokenizer class
# Creating an instance of tokenizer
tokenizer = KinTokenizer()
# Loading the state of the tokenizer (pretrained tokenizer)
tokenizer.load()
# Encoding
text = "Nagiye gusura inshuti zanjye dusoma ibitabo"
tokens = tokenizer.encode(text)
print(tokens)
# Decoding
decoded_text = tokenizer.decode(tokens)
print(decoded_text)
# Printing the vocab size
print(tokenizer.vocab_size)
# Print vocabulary (first 1000 items)
count = 0
for k, v in tokenizer.vocab.items():
print("{} : {}".format(k, v))
count += 1
if count > 1000:
break
Training Your Own Tokenizer
You can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text.
from kin_tokenizer import KinTokenizer
from kin_tokenizer.utils import train_kin_tokenizer, create_sequences
# Training the tokenizer
tokenizer = train_kin_tokenizer(training_text, vocab_size=512, save=True, tokenizer_path=SAVE_PATH_ROOT)
# Creating sequences
x_seq, y_seq = create_sequences(tokens, seq_len=128)
Contributing
The project is still being updated and contributions are welcome. You can contribute by:
- Reporting bugs
- Suggesting features
- Writing or improving documentation
- Submitting pull requests
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kin_tokenizer-3.2.tar.gz
.
File metadata
- Download URL: kin_tokenizer-3.2.tar.gz
- Upload date:
- Size: 708.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1cda1d7a14cc04116d8020c3180bac3479720cc70c26afa10dc63eb937b8c56 |
|
MD5 | e43e1a09e99f59b574deadfcadde58cc |
|
BLAKE2b-256 | 139a7019a0b606ec15dce0c96075bc2da67cbba576558bad233a037603298ad9 |
File details
Details for the file kin_tokenizer-3.2-py3-none-any.whl
.
File metadata
- Download URL: kin_tokenizer-3.2-py3-none-any.whl
- Upload date:
- Size: 717.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3f89a568f772620181746dbc8d81e16eb3983c7f5a580c573b37fd27442f98c |
|
MD5 | 03061d986236865b06e1bf8124faead3 |
|
BLAKE2b-256 | 332920815cadec6fb1b96cef24f089c3e64fd3943d6db58e34f9952b0d413fac |