Skip to main content

ViLLM Fast Tokenizer — Rust backend with embedded SentencePiece for Vietnamese-English code-switching

Project description

villm-tok-fast

Rust-backed fast tokenizer for Vietnamese-English code-switching, used in the viLLM project.

Installation

pip install villm-tok-fast

Usage

from villm_tok_fast import create_fast_tokenizer
from transformers import PreTrainedTokenizerFast

# Load from HuggingFace Hub (auto-downloads vocab files)
hf_tokenizer = create_fast_tokenizer("vlinhd11/villm-tokenizer")

# Or load from local directory
hf_tokenizer = create_fast_tokenizer("./path/to/villm-tokenizer")

# Tokenize
enc = hf_tokenizer("Học sinh giỏi tiếng Việt và học lập trình Python")
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']

# Batch encode
batch = hf_tokenizer(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

Features

  • Language detection — per-word Vi/EN/Num/Code/Punct classification
  • Vietnamse Viterbi — merges frequent syllable bigrams into compound tokens (học_sinh, Việt_Nam)
  • English subword — embedded SentencePiece trie for English OOV words (▁Python, ▁programming)
  • Code-switch markers — optional [VI→EN] / [EN→VI] at language boundaries
  • Byte fallback<0xNN> for unknown characters

Performance

~4x faster than the pure Python equivalent (~55k texts/sec vs ~14k).

Requirements

  • Python 3.8+
  • transformers >= 4.30.0

How it works

This package provides a Rust implementation of the ViLLM tokenizer as a PyO3 native extension. It exposes a Python class compatible with HuggingFace PreTrainedTokenizerFast via the tokenizer_object= parameter.

The Rust core handles:

  1. Word-level language detection
  2. Viterbi dynamic programming for Vietnamese compound formation
  3. SentencePiece trie with whole-word preference for English subword
  4. Code-switch marker insertion
  5. Byte fallback for OOV characters
  6. Decoding with smart punctuation join

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

villm_tok_fast-0.2.0-cp38-abi3-win_amd64.whl (871.6 kB view details)

Uploaded CPython 3.8+Windows x86-64

File details

Details for the file villm_tok_fast-0.2.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for villm_tok_fast-0.2.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 02556861ddce68e2d552ca19fe0003b5aa219b9ab3cdb132bf4d6de80de32410
MD5 0866e594f62989bec4c5894b17d7de65
BLAKE2b-256 4196e49881061594f57113f0e18a1dcdae8073574a827947d5838f683d354c5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page