ALTA tokenizer for encoding and decoding Kinyarwanda language text

These details have not been verified by PyPI

Project links

Homepage

Project description

ALTA Tokenizer

alta-tokenizer is a Python library designed for tokenizing Kinyarwanda language text, it can also tokenizer other languages like English, French or similar languages but with low compression rate since the tokenizer was trained on Kinyarwanda text only. There is an option for training your own custom tokenizer using defined function or method. It is covered in the section of training your own tokenizer. Hence, you can use that way to train your own tokenizer using dataset for a different language. It is based on the Byte Pair Encoding (BPE) algorithm.

It can both encode and decode text in Kinyarwanda. The metric used to measure the accuracy of this tokenizer is the compression rate and ability to encode and decode texts

Compression rate is the ratio of the total original number of characters in the text to the number of tokens in the encoded text.

Example:

For the sentence: "Nagiye gusura abanyeshuri."

The sentence has 26 characters.
Suppose the sentence is tokenized into the following tokens: [23, 45, 67, 89, 23, 123, 44, 22, 55, 22, 45].
The total number of tokens is 11.

$$ \text{Compression Rate} = \frac{26}{11} $$

So, the compression rate is 2.36X (where X indicates that the number is approximate).

Special Tokens

<|PAD|> is a special token for padding, it is represented by 0 in the vocab

<|EOS|> is a special token for indicating the end of sequence

<|BOS|> is a special token for indicating the begining of sequence

<|SEP|> is a speical token for separating two sequences

<|MASK|> is a spicial token for masking

<|UNK|> is a special token for unknown

<|CLS|> is a special token for classification

ALTA-Tokenizer 1.0

Version 1.0 features a vocabulary size of 32,256 with improved words or subwords. It was trained on over 19 millions Kinyarwanda characters and shows a significantly improved compression rate compared to the previous version.

ALTA-Tokenizer 2.0

Version 1.0 features a vocabulary size of 50,257 with improved words or subwords. Improved the tokenization with more corpus

Installation

You can install the package using pip:

    pip install alta-tokenizer

Basis Usage

   from kin_tokenizer import KinTokenizer  # Importing Tokenizer class
   from kin_tokenizer.utils import create_sequences, create_dataset
   # Creating an instance of tokenizer
   tokenizer = KinTokenizer()

   # Encoding
   text = """
   Nuko Semugeshi akunda uwo mutwe w'abatwa, bitwaga Ishabi, awugira intore ze. Bukeye
bataha biyereka ingabo; dore ko hambere nta mihamirizo yindi yabaga mu Rwanda; guhamiriza
byaje vuba biturutse i Burundi. Ubwo bataha Semugeshi n'abatware be barabitegereza basanga
ari abahanga bose, ariko Waga akaba umuhanga w'imena muri bo; nyamara muri ubwo buhanga
bwe akagiramo intege nke ku mpamvu yo kunanuka, yari afite uruti ruke."""

   tokens = tokenizer.encode(text)
   print(tokens)

   # Calculating compression rate
   text_len = len(text)
   tokens_len = len(tokens)

   # Decoding
   decoded_text = tokenizer.decode(tokens)
   print(decoded_text)

   compression_rate = text_len / tokens_len
   print(f"Compression rate: {compression_rate:.2f}X")

       
   # Creating dataset for training your LLM
   create_dataset(text_file_path=text_file_path, nbr_processes=nbr_processes, sequence_length=sequence_length, destination_dir=destination_dir, step_size=step_size)

   # Printing the vocab size
   print(tokenizer.vocab_size)

   # Print vocabulary (first 200 items)
   count = 0
   for k, v in tokenizer.vocab.items():
       print("{} : {}".format(k, v))
       count += 1
       if count > 100:
           break

Training Your Own Tokenizer

You can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text. N.B: Your chosen vocab_size will be met depening on the amount of data you have used for training. The vocab_size is a hyperparameter to be adjusted for better vocabularies in your vocab, and also the size of your dataset and diversity matters. The vocab is initialized by count of 256 from 1-255 unicode characters and 0 for <|PAD|>.

    from kin_tokenizer import KinTokenizer
    from kin_tokenizer.utils import train_kin_tokenizer

    # Training the tokenizer
    tokenizer = train_kin_tokenizer(text, vocab_size=512, save=True, tokenizer_path=DATA_ROOT, retrain=False)

    # Encoding text using custom trained tokenizer
    tokens = tokenizer.encode(text)

Contributing

The project is still being updated and contributions are welcome. You can contribute by:

Reporting bugs
Suggesting features
Writing or improving documentation
Submitting pull requests

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0

Aug 31, 2025

1.2.3

Jun 13, 2025

1.2.2

May 4, 2025

1.2.1

Mar 30, 2025

1.2

Mar 30, 2025

1.1

Jan 26, 2025

1.0

Jan 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alta_tokenizer-2.0.tar.gz (2.1 MB view details)

Uploaded Aug 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alta_tokenizer-2.0-py3-none-any.whl (2.1 MB view details)

Uploaded Aug 31, 2025 Python 3

File details

Details for the file alta_tokenizer-2.0.tar.gz.

File metadata

Download URL: alta_tokenizer-2.0.tar.gz
Upload date: Aug 31, 2025
Size: 2.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for alta_tokenizer-2.0.tar.gz
Algorithm	Hash digest
SHA256	`6614e762d6fca76d191865ed2157d9b17d2f3131b61644a685e66bc02ce4307c`
MD5	`9f891667ec7e90fbe7f57272d8e41c5d`
BLAKE2b-256	`4e8163f6d81fcd9e859e85c403a53d44834acd0fd01c8ab422a6f669ae947a32`

See more details on using hashes here.

File details

Details for the file alta_tokenizer-2.0-py3-none-any.whl.

File metadata

Download URL: alta_tokenizer-2.0-py3-none-any.whl
Upload date: Aug 31, 2025
Size: 2.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for alta_tokenizer-2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c99dbf717bc0f4922548bff60c030812b785e720c497622c5ca77372f4075e5`
MD5	`a1e871fcec48b554e6f39a786edac06e`
BLAKE2b-256	`a59a3f9f4ae2e4e93625198e285867f55a9277ecbe7749a4c81649786615f245`

See more details on using hashes here.

alta-tokenizer 2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

ALTA Tokenizer

Special Tokens

ALTA-Tokenizer 1.0

ALTA-Tokenizer 2.0

Installation

Basis Usage

Training Your Own Tokenizer

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes