Skip to main content

A simple tokenizer for Mon text

Project description

Mon Tokenizer

Tokenize Mon text like a pro. No fancy stuff, just gets the job done.

Quick Start

# Using pip
pip install mon-tokenizer

# Using uv (faster)
uv add mon-tokenizer
from mon_tokenizer import MonTokenizer

tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။

CLI

# Tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Verbose mode (shows all the details)
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Decode tokens back to text
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"

API

  • encode(text) - Chop text into tokens
  • decode(pieces) - Glue tokens back together
  • decode_ids(ids) - Convert IDs back to text
  • get_vocab_size() - How many tokens we know
  • get_vocab() - The whole vocabulary

Dev Setup

git clone git@github.com:janakhpon/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest

# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "v0.1.1"
git tag v0.1.1
git push origin main --tags

License

MIT - do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mon_tokenizer-0.1.2.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mon_tokenizer-0.1.2-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file mon_tokenizer-0.1.2.tar.gz.

File metadata

  • Download URL: mon_tokenizer-0.1.2.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.13

File hashes

Hashes for mon_tokenizer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ff5fc8f032064b812d90ee086580868c1820d8acd5f8c21dfcc7a03f297b781a
MD5 460c23db17df5239540901f925000219
BLAKE2b-256 a70187bac119504325fe48467eaf50ccaa48731bb3f18d0663b3fb9b0f0b2228

See more details on using hashes here.

File details

Details for the file mon_tokenizer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for mon_tokenizer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 540e609aec80839d87682862613789d9ab6bdbf01ca45241b5b2bc796bced961
MD5 5ff8ce17655e09ce40947577f70b278c
BLAKE2b-256 af4e786a899a9f44e1d7161eec215d42fc3d13f03379899b71f4b80b88f5ca9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page