Skip to main content

A simple tokenizer for Mon text

Project description

Mon Tokenizer

Tokenize Mon text like a pro. No fancy stuff, just gets the job done.

Quick Start

# Using pip
pip install mon-tokenizer

# Using uv (faster)
uv add mon-tokenizer
from mon_tokenizer import MonTokenizer

tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။

CLI

# Tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Verbose mode (shows all the details)
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Decode tokens back to text
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"

API

  • encode(text) - Chop text into tokens
  • decode(pieces) - Glue tokens back together
  • decode_ids(ids) - Convert IDs back to text
  • get_vocab_size() - How many tokens we know
  • get_vocab() - The whole vocabulary

Dev Setup

git clone git@github.com:janakhpon/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest

# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "v0.1.1"
git tag v0.1.1
git push origin main --tags

License

MIT - do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mon_tokenizer-0.1.0.tar.gz (194.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mon_tokenizer-0.1.0-py3-none-any.whl (193.1 kB view details)

Uploaded Python 3

File details

Details for the file mon_tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: mon_tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 194.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for mon_tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8668651d3021a48eaacaa1d5b8c9f5634f8d0477122c31098df549d3e47cb9b7
MD5 1a2ee9a921655ae8a43ce385ecf7f11f
BLAKE2b-256 2a4f53b1658d43aa9d1452ade284682296872a5234023e78dad90eacd3b72a66

See more details on using hashes here.

File details

Details for the file mon_tokenizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mon_tokenizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 193.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for mon_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05ff48dd0899b978f03c7542829a26cb1e75d1b58766e1b1124e268306473a63
MD5 2c96acb630d102d68979a6786b9d10fb
BLAKE2b-256 22a175afa0565b638ac67dd36212cc09d0ca206448f16a66b50af574f10c4c6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page