Skip to main content

A simple tokenizer for Burmese text

Project description

Burmese Tokenizer

Tokenize Burmese text like a pro. No fancy stuff, just gets the job done.

Quick Start

# Using pip
pip install burmese-tokenizer

# Using uv (faster)
uv add burmese-tokenizer
from burmese_tokenizer import BurmeseTokenizer

tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # မင်္ဂလာပါ။ နေကောင်းပါသလား။

CLI

# Tokenize
burmese-tokenizer "မင်္ဂလာပါ။"

# Verbose mode (shows all the details)
burmese-tokenizer -v "မင်္ဂလာပါ။"

# Decode tokens back to text
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"

API

  • encode(text) - Chop text into tokens
  • decode(pieces) - Glue tokens back together
  • decode_ids(ids) - Convert IDs back to text
  • get_vocab_size() - How many tokens we know
  • get_vocab() - The whole vocabulary

Dev Setup

git clone git@github.com:Code-Yay-Mal/burmese_tokenizer.git
cd burmese_tokenizer
uv sync --dev
uv run pytest

License

MIT - do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

burmese_tokenizer-0.1.1.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

burmese_tokenizer-0.1.1-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file burmese_tokenizer-0.1.1.tar.gz.

File metadata

  • Download URL: burmese_tokenizer-0.1.1.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for burmese_tokenizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2bab72a9c3b3fb07a6157a47fc458933d53a1ab49cee6e8ca01c0d2d67d95299
MD5 9e1e77a9769b72bdcf34795485f8ae41
BLAKE2b-256 71720cb4232bde4bf03b6c3be38d70080fc6270eba9e1ad25f64d600c46bba7f

See more details on using hashes here.

File details

Details for the file burmese_tokenizer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for burmese_tokenizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4c0d4eedee6abdc65fdc9bea1d8ba9b4bf5ffb70d973fdb45bbfc7986cd415c2
MD5 26459e6c7b85d62a9fc9a3b1d1377ab3
BLAKE2b-256 91c8684e2c6240afbea76180a7b305d21631c19e623d2f575aa4a74f7126b26c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page