A simple tokenizer for Mon text
Project description
Mon Tokenizer
Tokenize Mon text like a pro. No fancy stuff, just gets the job done.
Performance
Trained on 41.4M Mon-related characters (within a 92.8M total character / 176.7M byte raw corpus).
| Metric | Result |
|---|---|
| Vocabulary size | 32,000 |
| Avg compression | 5.22 chars/token |
| Round-trip accuracy | 100% |
| Byte-fallback rate | 0.00% |
| Model size | 977 KB |
quick start
# using pip
pip install mon-tokenizer
# using uv
uv add mon-tokenizer
from mon_tokenizer import MonTokenizer
tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"
# tokenize
result = tokenizer.encode(text)
print(result["pieces"]) # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']
print(result["ids"]) # [1234, 5678, ...]
# decode
decoded = tokenizer.decode(result["pieces"])
print(decoded) # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။
Tokenizer in Hugging Face Format
from transformers import AutoTokenizer
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")
# tokenize
text = "ပ္ဍဲအခိင်မာံနဲသဵု မဒှ်ဘဝကွးဘာတက္ကသိုလ်ဂှ် ပါလုပ်ချဳဓရာင်ကၠုင်"
tokens = tokenizer(text, return_tensors="pt")
input_ids = tokens["input_ids"][0]
print("token ids:", input_ids.tolist())
print("tokens:", tokenizer.convert_ids_to_tokens(input_ids))
print("decoded:", tokenizer.decode(input_ids, skip_special_tokens=True))
cli
# tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"
# verbose output
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"
# decode tokens
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"
# interactive mode
mon-tokenizer
API
encode(text: str)→{"pieces": list, "ids": list, "text": str}decode(pieces: list)→strdecode_ids(ids: list)→strget_vocab_size()→intget_vocab()→dict
Dev Setup
git clone git@github.com:Code-Yay-Mal/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest
# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "bump version"
git tag v0.1.5
git push origin main --tags
Resources
License
MIT - do whatever you want with it.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mon_tokenizer-0.2.3.tar.gz
(331.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
mon_tokenizer-0.2.3-py3-none-any.whl
(332.2 kB
view details)
File details
Details for the file mon_tokenizer-0.2.3.tar.gz.
File metadata
- Download URL: mon_tokenizer-0.2.3.tar.gz
- Upload date:
- Size: 331.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46b9f0a35f6fd239871cec925384307b44e4c5a7a3a244782fb79e761e3011ca
|
|
| MD5 |
f11cd945513b2e686f16b12553aeb862
|
|
| BLAKE2b-256 |
a3e099e370dd04c171ea234922502b30cffcce245a600b2a50ef20c9f7677fe2
|
File details
Details for the file mon_tokenizer-0.2.3-py3-none-any.whl.
File metadata
- Download URL: mon_tokenizer-0.2.3-py3-none-any.whl
- Upload date:
- Size: 332.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84fc72cb031b8efb73a0585f1270532ff3d0cd806fc8be0d544bc24c412c7121
|
|
| MD5 |
85ea10f1fb095360a1bc2c541b480c4b
|
|
| BLAKE2b-256 |
2a4223a829cbbedd11953ac1af102cbd274881ed934c2981920e8dd88f2c70e7
|