Skip to main content

A simple tokenizer for Burmese text

Project description

Burmese Tokenizer

Tokenize Burmese text like a pro. No fancy stuff, just gets the job done.

Quick Start

# Using pip
pip install burmese-tokenizer

# Using uv (faster)
uv add burmese-tokenizer
from burmese_tokenizer import BurmeseTokenizer

tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # မင်္ဂလာပါ။ နေကောင်းပါသလား။

CLI

# Tokenize
burmese-tokenizer "မင်္ဂလာပါ။"

# Verbose mode (shows all the details)
burmese-tokenizer -v "မင်္ဂလာပါ။"

# Decode tokens back to text
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"

API

  • encode(text) - Chop text into tokens
  • decode(pieces) - Glue tokens back together
  • decode_ids(ids) - Convert IDs back to text
  • get_vocab_size() - How many tokens we know
  • get_vocab() - The whole vocabulary

Dev Setup

git clone git@github.com:Code-Yay-Mal/burmese_tokenizer.git
cd burmese_tokenizer
uv sync --dev
uv run pytest

uv build
uv build --no-sources 
# make sure to have pypirc
uv run twine upload dist/*  or uv publish

# bump version
uv version --bump patch
uv version --short

# or publish with gh-action
git tag v0.1.2 
git push origin v0.1.2 

# if something goes wrong delete and restart all over again
git tag -d v0.1.2 && git push origin :refs/tags/v0.1.2 

License

MIT - do whatever you want with it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

burmese_tokenizer-0.1.2.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

burmese_tokenizer-0.1.2-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file burmese_tokenizer-0.1.2.tar.gz.

File metadata

  • Download URL: burmese_tokenizer-0.1.2.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.13

File hashes

Hashes for burmese_tokenizer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fc38ec28e8722e23ea8a24e728cd5b327a07c5b2ee490d73b6c3fedc018dbb97
MD5 5a5aed05940f387253bb958df70c5a40
BLAKE2b-256 14266bb9852e13aec329a3d7a96b427a50beb044df412607d4936088a5eff678

See more details on using hashes here.

File details

Details for the file burmese_tokenizer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for burmese_tokenizer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bcc356f296ba258efdc9b76d9126252c0b8fc15ced377ed0aa688be09f48c31d
MD5 0a9372a3001252db37a2b7366a50be8e
BLAKE2b-256 a719ef79ac440bbc3dc880b727e12ec965940d003e9783843e7c3fdc657a628a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page