ViLLM Fast Tokenizer — Rust backend with embedded SentencePiece for Vietnamese-English code-switching
Project description
villm-tok-fast
Rust-backed fast tokenizer for Vietnamese-English code-switching, used in the viLLM project.
Installation
pip install villm-tok-fast
Usage
from villm_tok_fast import create_fast_tokenizer
from transformers import PreTrainedTokenizerFast
# Load from HuggingFace Hub (auto-downloads vocab files)
hf_tokenizer = create_fast_tokenizer("vlinhd11/villm-tokenizer")
# Or load from local directory
hf_tokenizer = create_fast_tokenizer("./path/to/villm-tokenizer")
# Tokenize
enc = hf_tokenizer("Học sinh giỏi tiếng Việt và học lập trình Python")
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']
# Batch encode
batch = hf_tokenizer(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)
Features
- Language detection — per-word Vi/EN/Num/Code/Punct classification
- Vietnamse Viterbi — merges frequent syllable bigrams into compound tokens (
học_sinh,Việt_Nam) - English subword — embedded SentencePiece trie for English OOV words (
▁Python,▁programming) - Code-switch markers — optional
[VI→EN]/[EN→VI]at language boundaries - Byte fallback —
<0xNN>for unknown characters
Performance
~4x faster than the pure Python equivalent (~55k texts/sec vs ~14k).
Requirements
- Python 3.8+
- transformers >= 4.30.0
How it works
This package provides a Rust implementation of the ViLLM tokenizer as a PyO3 native extension.
It exposes a Python class compatible with HuggingFace PreTrainedTokenizerFast via the
tokenizer_object= parameter.
The Rust core handles:
- Word-level language detection
- Viterbi dynamic programming for Vietnamese compound formation
- SentencePiece trie with whole-word preference for English subword
- Code-switch marker insertion
- Byte fallback for OOV characters
- Decoding with smart punctuation join
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file villm_tok_fast-0.1.0-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: villm_tok_fast-0.1.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 864.6 kB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6b9de0c79c57b9973bb1f615ea2203c87bd22ddb6b13714b358c7dd83485ee5
|
|
| MD5 |
b7e4d82fc6c27b44b561a2d2ef0d932d
|
|
| BLAKE2b-256 |
3de4ab8b785cca0f51fda32abf3eb537902b7ec7ba8031ae14c66445e311b1c9
|