BPE & Unigram trainer for Shredword tokenizer
Project description
ShredWord
ShredWord is a high-performance tokenizer training library supporting Byte-Pair Encoding (BPE) and Unigram Language Model algorithms. Designed for fast, efficient, and flexible text processing, it provides vocabulary training functionalities backed by a C/C++ core with a Python interface for seamless integration into machine learning workflows.
Features
- Multiple Tokenization Algorithms: Supports both BPE and Unigram training methods for flexible vocabulary generation
- Efficient Tokenization: Utilizes optimized algorithms for compressing text data and reducing vocabulary size
- Customizable Vocabulary: Allows users to define target vocabulary size, character coverage, and algorithm-specific parameters
- Save and Load Models: Supports saving and loading trained tokenizers for reuse across projects
- Python Integration: Provides a clean Python interface for seamless integration into NLP pipelines
- C/C++ CLI: Includes command-line interface for direct training without Python dependencies
How It Works
Byte-Pair Encoding (BPE)
BPE is a subword tokenization algorithm that compresses a dataset by iteratively merging the most frequent pairs of characters or subwords into new tokens. This process continues until a predefined vocabulary size is reached.
Key steps:
- Initialize the vocabulary with all unique characters in the dataset
- Count the frequency of character pairs
- Merge the most frequent pair into a new token
- Repeat until the target vocabulary size is achieved
Unigram Language Model
Unigram is a probabilistic subword tokenization algorithm that starts with a large seed vocabulary and iteratively prunes tokens with the lowest likelihood scores using the Expectation-Maximization (EM) algorithm.
Key steps:
- Generate a large initial vocabulary from all possible substrings
- Compute likelihood scores for each subword in the corpus
- Update subword probabilities using EM iterations
- Prune lowest-scoring subwords until target vocabulary size is reached
ShredWord implements both algorithms efficiently in C/C++, exposing training and vocabulary management methods through Python.
Installation
Prerequisites
- Python 3.11+
- GCC or a compatible compiler (for building from source)
Steps
Install the Python package from PyPI.org:
pip install shredword-trainer
Usage
Below are examples demonstrating how to use ShredWord for training tokenizers with both BPE and Unigram algorithms.
BPE Trainer
from shredword.trainer import BPETrainer
trainer = BPETrainer(
vocab_size=8192,
unk_id=0,
character_coverage=0.995,
min_pair_freq=2000
)
trainer.load_corpus("data/corpus.txt")
trainer.train()
trainer.save("model/bpe.model", "model/bpe.vocab")
trainer.destroy()
Unigram Trainer
Note: Unigram implementation is currently under development and may not be fully functional.
from shredword.trainer import UnigramTrainer
trainer = UnigramTrainer(
vocab_size=32000,
character_coverage=0.9995,
max_sentencepiece_length=16,
seed_size=1000000
)
trainer.load_corpus("data/corpus.txt")
trainer.train(num_iterations=10)
trainer.save("model/unigram.vocab")
trainer.destroy()
Context Manager Pattern
from shredword.trainer import BPETrainer
with BPETrainer(vocab_size=16000) as trainer:
trainer.load_corpus("data/corpus.txt")
trainer.train()
trainer.save("model/bpe.model", "model/bpe.vocab")
Multiple Corpus Training
from shredword.trainer import BPETrainer
trainer = BPETrainer(vocab_size=25000)
corpus_files = ["data/corpus1.txt", "data/corpus2.txt", "data/corpus3.txt"]
for corpus_file in corpus_files:
trainer.load_corpus(corpus_file)
trainer.train()
trainer.save("model/multi.model", "model/multi.vocab")
trainer.destroy()
API Overview
BPETrainer
Constructor Parameters
vocab_size(int): Target vocabulary size. Default: 8192unk_id(int): ID for unknown tokens. Default: 0character_coverage(float): Character coverage ratio (0.0-1.0). Default: 0.995min_pair_freq(int): Minimum frequency for pair merging. Default: 2000
Methods
load_corpus(path): Load training corpus from a text filetrain(): Train the BPE model on loaded corpussave(model_path, vocab_path): Save trained model and vocabularydestroy(): Release trainer resources
UnigramTrainer
Constructor Parameters
vocab_size(int): Target vocabulary size. Default: 32000character_coverage(float): Character coverage ratio (0.0-1.0). Default: 0.9995max_sentencepiece_length(int): Maximum length of sentence pieces. Default: 16seed_size(int): Initial seed vocabulary size. Default: 1000000
Methods
load_corpus(path): Load training corpus from a text filetrain(num_iterations): Train the Unigram model using EM algorithmsave(vocab_path): Save trained vocabularydestroy(): Release trainer resources
C/C++ CLI Usage
ShredWord also provides a command-line interface for training directly without Python.
Compilation
Windows:
g++ -o trainer.exe trainer.cpp bpe/bpe.cpp bpe/histogram.cpp bpe/hash.cpp bpe/heap.cpp unigram/unigram.cpp unigram/heap.cpp unigram/cache.cpp unigram/hashmap.cpp unigram/subword.cpp trie.cpp -I. -std=c++11
Linux:
g++ -o trainer.exe trainer.cpp bpe/bpe.cpp bpe/histogram.cpp bpe/hash.cpp bpe/heap.cpp unigram/unigram.cpp unigram/heap.cpp unigram/cache.cpp unigram/hashmap.cpp unigram/subword.cpp trie.cpp
Training with CLI
BPE:
trainer.exe input=corpus.txt model_type=bpe output_model=model.bin output_vocab=vocab.txt vocab_size=32000
Unigram:
trainer.exe input=corpus.txt model_type=unigram output_model=model.bin output_vocab=vocab.txt vocab_size=32000 num_iterations=10
Advanced Features
Error Handling
from shredword.trainer import BPETrainer
try:
trainer = BPETrainer(vocab_size=10000)
trainer.load_corpus("corpus.txt")
trainer.train()
trainer.save("model.model", "vocab.vocab")
except RuntimeError as e:
print(f"Training error: {e}")
except IOError as e:
print(f"File error: {e}")
finally:
trainer.destroy()
Resource Management
Always call destroy() to properly clean up resources, or use the context manager pattern for automatic cleanup:
with BPETrainer(vocab_size=16000) as trainer:
trainer.load_corpus("data.txt")
trainer.train()
trainer.save("model.model", "vocab.vocab")
Configuration Guidelines
Vocabulary Size
- Small models: 8,192 - 16,384
- Medium models: 32,000 - 50,000
- Large models: 50,000 - 100,000
Character Coverage
- English: 0.995 - 0.999
- Multilingual: 0.9995 - 1.0
Minimum Pair Frequency (BPE)
- Small corpus: 100 - 1,000
- Large corpus: 2,000 - 10,000
EM Iterations (Unigram)
- Quick training: 5 - 8
- Standard training: 10 - 15
- High quality: 15 - 20
File Formats
Input Corpus
- Plain text files
- UTF-8 encoding recommended
- One sentence per line (typical)
Output Files
- Model file (.model/.bin): Contains merge operations (BPE) or metadata (Unigram)
- Vocabulary file (.vocab/.txt): Contains vocabulary mapping
Documentation
For detailed documentation on both BPE and Unigram trainers, including API references, configuration parameters, and troubleshooting guides, refer to:
Known Limitations
- Unigram implementation is currently under development and may not function as expected
- Maximum corpus size may be limited by available system memory
- CLI interface has a maximum text limit for Unigram training
Project Information
A project by Shivendra
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shredword_trainer-0.1.0.tar.gz.
File metadata
- Download URL: shredword_trainer-0.1.0.tar.gz
- Upload date:
- Size: 50.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef57062d7db4604d3f36cb1d195a920c879b9831e51f02827d4179d218c16fe5
|
|
| MD5 |
3b1ee04259c905de67b5ca3c6fa0f759
|
|
| BLAKE2b-256 |
dedcdb3b7e1726db143f2338d8d22d24755575d976bb37913dda14695d1a1622
|
File details
Details for the file shredword_trainer-0.1.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: shredword_trainer-0.1.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 63.7 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81631ea6795dd73baf9926eaf6ec045281bdf63b4c018915c5316d33f5a4d7a2
|
|
| MD5 |
494064c76021d486dc0b67fa4764bcfe
|
|
| BLAKE2b-256 |
96fa32c21623ae5c22b005ec0cb009423d807c3a7f891f086b2697629f486f7f
|