Skip to main content

Amharic language tokenizers

Project description

Amharic Language Tokenizers

This package contains set of Classes which can be used to encode Amharic language sentences into tokens that could be used by language models. The tokenizers are trained using Contemporary Amharic Corpus (CACO) dataset

Installing

Pip installation

pip install -i https://test.pypi.org/simple/ amtokenizers==0.0.5

Sample Code

Variable length

from amtokenizers import AmTokenizer

a  = AmTokenizer(10000, 5 , "byte_bpe")
encoded = a.encode("አበበ በሶ በላ።", return_tokens=False)
print("encoded", encoded.tokens)
# encoded ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį', '¢', '</s>']
print("decoded:", a.decode(encoded.ids))
# decoded: <s>አበበ በሶ በላ።</s>

Fixed length

a  = AmTokenizer(10000, 5 , "byte_bpe", max_length=16)
encoded = a.encode("አበበ በሶ በላ።")
print("encoded", encoded.tokens())
# encoded ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį', '¢', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
print(encoded.input_ids)
# [0, 337, 3251, 3598, 3486, 270, 100, 2, 1, 1, 1, 1, 1, 1, 1, 1]
print("decoded:", a.decode(encoded.input_ids))
# decoded: <s>አበበ በሶ በላ።</s><pad><pad><pad><pad><pad><pad><pad><pad>

Disclaimer

This package is highly inspired by Hugging Face's How to train a new language model from scratch using Transformers and Tokenizers tutorial.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amtokenizers-0.0.10.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

amtokenizers-0.0.10-py3-none-any.whl (8.4 MB view details)

Uploaded Python 3

File details

Details for the file amtokenizers-0.0.10.tar.gz.

File metadata

  • Download URL: amtokenizers-0.0.10.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for amtokenizers-0.0.10.tar.gz
Algorithm Hash digest
SHA256 b2f4becb4e1ce9671787bbf8e898be3c987532c78ae531bbd80f5a4bd9c1bf9e
MD5 13c5349417c9559ae087aea2bbb5c1c5
BLAKE2b-256 5c1b7577a0dd7d7f87405ffec0dc75db6f0b64e9c86d192a60226527ab1cc19a

See more details on using hashes here.

File details

Details for the file amtokenizers-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: amtokenizers-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 8.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for amtokenizers-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 00a2595bbb16a0e00e9e1e7d7f22a0ebdb29d8088023b10d2bc82434f9e6787a
MD5 34000c96db6776d74d53bc4b58f9dc36
BLAKE2b-256 e69dd623e8f6f3de4374e86b2b76193c50380ca49fa39fea6874319c014a6f14

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page