A simple tokenizer for Mon text
Project description
Mon Tokenizer
Tokenize Mon text like a pro. No fancy stuff, just gets the job done.
Quick Start
# Using pip
pip install mon-tokenizer
# Using uv (faster)
uv add mon-tokenizer
from mon_tokenizer import MonTokenizer
tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"
# Tokenize
result = tokenizer.encode(text)
print(result["pieces"]) # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']
# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded) # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။
CLI
# Tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"
# Verbose mode (shows all the details)
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"
# Decode tokens back to text
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"
API
encode(text)- Chop text into tokensdecode(pieces)- Glue tokens back togetherdecode_ids(ids)- Convert IDs back to textget_vocab_size()- How many tokens we knowget_vocab()- The whole vocabulary
Dev Setup
git clone git@github.com:janakhpon/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest
# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "v0.1.1"
git tag v0.1.1
git push origin main --tags
License
MIT - do whatever you want with it.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mon_tokenizer-0.1.0.tar.gz
(194.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
mon_tokenizer-0.1.0-py3-none-any.whl
(193.1 kB
view details)
File details
Details for the file mon_tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: mon_tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 194.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8668651d3021a48eaacaa1d5b8c9f5634f8d0477122c31098df549d3e47cb9b7
|
|
| MD5 |
1a2ee9a921655ae8a43ce385ecf7f11f
|
|
| BLAKE2b-256 |
2a4f53b1658d43aa9d1452ade284682296872a5234023e78dad90eacd3b72a66
|
File details
Details for the file mon_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mon_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 193.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05ff48dd0899b978f03c7542829a26cb1e75d1b58766e1b1124e268306473a63
|
|
| MD5 |
2c96acb630d102d68979a6786b9d10fb
|
|
| BLAKE2b-256 |
22a175afa0565b638ac67dd36212cc09d0ca206448f16a66b50af574f10c4c6a
|