Skip to main content

ViLLM Fast Tokenizer — Rust backend with embedded SentencePiece for Vietnamese-English code-switching

Project description

villm-tok-fast

Rust-backed fast tokenizer for Vietnamese-English code-switching, used in the viLLM project.

Installation

pip install villm-tok-fast

Usage

from villm_tok_fast import create_fast_tokenizer
from transformers import PreTrainedTokenizerFast

# Load from HuggingFace Hub (auto-downloads vocab files)
hf_tokenizer = create_fast_tokenizer("vlinhd11/villm-tokenizer")

# Or load from local directory
hf_tokenizer = create_fast_tokenizer("./path/to/villm-tokenizer")

# Tokenize
enc = hf_tokenizer("Học sinh giỏi tiếng Việt và học lập trình Python")
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']

# Batch encode
batch = hf_tokenizer(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

Features

  • Language detection — per-word Vi/EN/Num/Code/Punct classification
  • Vietnamse Viterbi — merges frequent syllable bigrams into compound tokens (học_sinh, Việt_Nam)
  • English subword — embedded SentencePiece trie for English OOV words (▁Python, ▁programming)
  • Code-switch markers — optional [VI→EN] / [EN→VI] at language boundaries
  • Byte fallback<0xNN> for unknown characters

Performance

~4x faster than the pure Python equivalent (~55k texts/sec vs ~14k).

Requirements

  • Python 3.8+
  • transformers >= 4.30.0

How it works

This package provides a Rust implementation of the ViLLM tokenizer as a PyO3 native extension. It exposes a Python class compatible with HuggingFace PreTrainedTokenizerFast via the tokenizer_object= parameter.

The Rust core handles:

  1. Word-level language detection
  2. Viterbi dynamic programming for Vietnamese compound formation
  3. SentencePiece trie with whole-word preference for English subword
  4. Code-switch marker insertion
  5. Byte fallback for OOV characters
  6. Decoding with smart punctuation join

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

villm_tok_fast-0.1.0-cp38-abi3-win_amd64.whl (864.6 kB view details)

Uploaded CPython 3.8+Windows x86-64

File details

Details for the file villm_tok_fast-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for villm_tok_fast-0.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d6b9de0c79c57b9973bb1f615ea2203c87bd22ddb6b13714b358c7dd83485ee5
MD5 b7e4d82fc6c27b44b561a2d2ef0d932d
BLAKE2b-256 3de4ab8b785cca0f51fda32abf3eb537902b7ec7ba8031ae14c66445e311b1c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page