Amharic language tokenizers

Project description

Amharic Language Tokenizers

This package contains set of Classes which can be used to encode Amharic language sentences into tokens that could be used by language models. The tokenizers are trained using Contemporary Amharic Corpus (CACO) dataset

Installing

Pip installation

pip install -i https://test.pypi.org/simple/ amtokenizers==0.0.5

Sample Code

Variable length

from amtokenizers import AmTokenizer

a  = AmTokenizer(10000, 5 , "byte_bpe")
encoded = a.encode("አበበ በሶ በላ።", return_tokens=False)
print("encoded", encoded.tokens)
# encoded ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį', '¢', '</s>']
print("decoded:", a.decode(encoded.ids))
# decoded: <s>አበበ በሶ በላ።</s>

Fixed length

a  = AmTokenizer(10000, 5 , "byte_bpe", max_length=16)
encoded = a.encode("አበበ በሶ በላ።")
print("encoded", encoded.tokens())
# encoded ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį', '¢', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
print(encoded.input_ids)
# [0, 337, 3251, 3598, 3486, 270, 100, 2, 1, 1, 1, 1, 1, 1, 1, 1]
print("decoded:", a.decode(encoded.input_ids))
# decoded: <s>አበበ በሶ በላ።</s><pad><pad><pad><pad><pad><pad><pad><pad>

Disclaimer

This package is highly inspired by Hugging Face's How to train a new language model from scratch using Transformers and Tokenizers tutorial.

Project details

Release history Release notifications | RSS feed

This version

0.0.10

Jun 5, 2021

0.0.9

Jun 5, 2021

0.0.8

Jun 5, 2021

0.0.6

Jun 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amtokenizers-0.0.10.tar.gz (2.8 kB view hashes)

Uploaded Jun 5, 2021 Source

Built Distribution

amtokenizers-0.0.10-py3-none-any.whl (8.4 MB view hashes)

Uploaded Jun 5, 2021 Python 3

Hashes for amtokenizers-0.0.10.tar.gz

Hashes for amtokenizers-0.0.10.tar.gz
Algorithm	Hash digest
SHA256	`b2f4becb4e1ce9671787bbf8e898be3c987532c78ae531bbd80f5a4bd9c1bf9e`
MD5	`13c5349417c9559ae087aea2bbb5c1c5`
BLAKE2b-256	`5c1b7577a0dd7d7f87405ffec0dc75db6f0b64e9c86d192a60226527ab1cc19a`

Hashes for amtokenizers-0.0.10-py3-none-any.whl

Hashes for amtokenizers-0.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`00a2595bbb16a0e00e9e1e7d7f22a0ebdb29d8088023b10d2bc82434f9e6787a`
MD5	`34000c96db6776d74d53bc4b58f9dc36`
BLAKE2b-256	`e69dd623e8f6f3de4374e86b2b76193c50380ca49fa39fea6874319c014a6f14`