Skip to main content

ViLLM Fast Tokenizer — Rust backend with embedded SentencePiece for Vietnamese-English code-switching

Project description

villm-tok-fast

Rust-backed fast tokenizer for Vietnamese-English code-switching, used in the viLLM project.

Installation

pip install villm-tok-fast

Usage

from villm_tok_fast import create_fast_tokenizer
from transformers import PreTrainedTokenizerFast

# Load from HuggingFace Hub (auto-downloads vocab files)
hf_tokenizer = create_fast_tokenizer("vlinhd11/villm-tokenizer")

# Or load from local directory
hf_tokenizer = create_fast_tokenizer("./path/to/villm-tokenizer")

# Tokenize
enc = hf_tokenizer("Học sinh giỏi tiếng Việt và học lập trình Python")
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']

# Batch encode
batch = hf_tokenizer(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

Features

  • Language detection — per-word Vi/EN/Num/Code/Punct classification
  • Vietnamse Viterbi — merges frequent syllable bigrams into compound tokens (học_sinh, Việt_Nam)
  • English subword — embedded SentencePiece trie for English OOV words (▁Python, ▁programming)
  • Code-switch markers — optional [VI→EN] / [EN→VI] at language boundaries
  • Byte fallback<0xNN> for unknown characters

Performance

~4x faster than the pure Python equivalent (~55k texts/sec vs ~14k).

Requirements

  • Python 3.8+
  • transformers >= 4.30.0

How it works

This package provides a Rust implementation of the ViLLM tokenizer as a PyO3 native extension. It exposes a Python class compatible with HuggingFace PreTrainedTokenizerFast via the tokenizer_object= parameter.

The Rust core handles:

  1. Word-level language detection
  2. Viterbi dynamic programming for Vietnamese compound formation
  3. SentencePiece trie with whole-word preference for English subword
  4. Code-switch marker insertion
  5. Byte fallback for OOV characters
  6. Decoding with smart punctuation join

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

villm_tok_fast-0.2.1-cp38-abi3-win_amd64.whl (873.0 kB view details)

Uploaded CPython 3.8+Windows x86-64

File details

Details for the file villm_tok_fast-0.2.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for villm_tok_fast-0.2.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 289d6d86546503dd885bcd0826ec20211db38e666fdb49aea87aa83ca7f4cd1e
MD5 6bb89408b807eb63f87f83758f707744
BLAKE2b-256 bcd2c057b3d759aabaa40c09130550c7b1fd30033b117f86545d8aca33661d43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page