Skip to main content

A simple tokenizer for Mon text

Project description

Mon Tokenizer

Tokenize Mon text like a pro. No fancy stuff, just gets the job done.

quick start

# using pip
pip install mon-tokenizer

# using uv
uv add mon-tokenizer
from mon_tokenizer import MonTokenizer

tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']
print(result["ids"])     # [1234, 5678, ...]

# decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။

Tokenizer in Hugging Face Format

from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")

# tokenize
text = "ပ္ဍဲအခိင်မာံနဲသဵု မဒှ်ဘဝကွးဘာတက္ကသိုလ်ဂှ် ပါလုပ်ချဳဓရာင်ကၠုင်"
tokens = tokenizer(text, return_tensors="pt")
input_ids = tokens["input_ids"][0]

print("token ids:", input_ids.tolist())
print("tokens:", tokenizer.convert_ids_to_tokens(input_ids))
print("decoded:", tokenizer.decode(input_ids, skip_special_tokens=True))

cli

# tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# verbose output
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# decode tokens
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"

# interactive mode
mon-tokenizer

API

  • encode(text: str){"pieces": list, "ids": list, "text": str}
  • decode(pieces: list)str
  • decode_ids(ids: list)str
  • get_vocab_size()int
  • get_vocab()dict

Dev Setup

git clone git@github.com:Code-Yay-Mal/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest

# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "bump version"
git tag v0.1.5
git push origin main --tags

Resources

License

MIT - do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mon_tokenizer-0.2.1.tar.gz (331.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mon_tokenizer-0.2.1-py3-none-any.whl (332.1 kB view details)

Uploaded Python 3

File details

Details for the file mon_tokenizer-0.2.1.tar.gz.

File metadata

  • Download URL: mon_tokenizer-0.2.1.tar.gz
  • Upload date:
  • Size: 331.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mon_tokenizer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 15ce8d6e219bde3ea5cdc677f5d24e20c0a8c1564f5663c3340c928ecc04770f
MD5 0ff9863fb1c50fb56f8b0ab2a64980d3
BLAKE2b-256 4095dcea4d9834648d7520cc64394b62e40080512c411824845a064c0d921f44

See more details on using hashes here.

File details

Details for the file mon_tokenizer-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mon_tokenizer-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 332.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mon_tokenizer-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cf64a86fe147e9c9631b7960bbe527d1ec7790ec84555e1ac842f061aeeb5d47
MD5 03ccb717533f975415daaca04125c494
BLAKE2b-256 4a8c1a0e6c06a0866faae00ad73a4b4b6f6e19f14d4403b42bdf5e36dcdd78a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page