Skip to main content

A simple tokenizer for Burmese text

Project description

Burmese Tokenizer

Simple, fast Burmese text tokenization. No fancy stuff, just gets the job done.

Install

pip install burmese-tokenizer

Quick Start

from burmese_tokenizer import BurmeseTokenizer

tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"

# tokenize
tokens = tokenizer.encode(text)
print(tokens["pieces"])
# ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']

# decode back
text = tokenizer.decode(tokens["pieces"])
print(text)
# မင်္ဂလာပါ။ နေကောင်းပါသလား။

CLI

# tokenize
burmese-tokenizer "မင်္ဂလာပါ။"

# show details
burmese-tokenizer -v "မင်္ဂလာပါ။"

# decode tokens
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"

API

  • encode(text) - tokenize text
  • decode(pieces) - convert tokens back to text
  • decode_ids(ids) - convert ids to text
  • get_vocab_size() - vocabulary size
  • get_vocab() - full vocabulary

Links

License

MIT - Do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

burmese_tokenizer-0.1.3.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

burmese_tokenizer-0.1.3-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file burmese_tokenizer-0.1.3.tar.gz.

File metadata

  • Download URL: burmese_tokenizer-0.1.3.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.13

File hashes

Hashes for burmese_tokenizer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5d7096e0f67635c9d50e6d2d5851d196c981d49beca87e68472e3a89b68a0100
MD5 1c34ffeb7f84f7ab314831b97f6f6712
BLAKE2b-256 81833d329f08f486d6a8f332d99de232cca62ed25aab4e37bad1886096e7dd14

See more details on using hashes here.

File details

Details for the file burmese_tokenizer-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for burmese_tokenizer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a4d65ea467d57e7ea1d1a55ccd12e6de6c1c7d787f1a081fd5b336509bc04c68
MD5 f0306f83d6d6ef0f4714c97f04ae487b
BLAKE2b-256 0c49b48277f7b90e58198a82b01642fae29ec1ba7e8e33a2672a823d1a56667a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page