BPE & Unigram trainer for Shredword tokenizer

These details have not been verified by PyPI

Project description

ShredWord

ShredWord is a high-performance tokenizer training library supporting Byte-Pair Encoding (BPE) and Unigram Language Model algorithms. Designed for fast, efficient, and flexible text processing, it provides vocabulary training functionalities backed by a C/C++ core with a Python interface for seamless integration into machine learning workflows.

Features

Multiple Tokenization Algorithms: Supports both BPE and Unigram training methods for flexible vocabulary generation
Efficient Tokenization: Utilizes optimized algorithms for compressing text data and reducing vocabulary size
Customizable Vocabulary: Allows users to define target vocabulary size, character coverage, and algorithm-specific parameters
Save and Load Models: Supports saving and loading trained tokenizers for reuse across projects
Python Integration: Provides a clean Python interface for seamless integration into NLP pipelines
C/C++ CLI: Includes command-line interface for direct training without Python dependencies

How It Works

Byte-Pair Encoding (BPE)

BPE is a subword tokenization algorithm that compresses a dataset by iteratively merging the most frequent pairs of characters or subwords into new tokens. This process continues until a predefined vocabulary size is reached.

Key steps:

Initialize the vocabulary with all unique characters in the dataset
Count the frequency of character pairs
Merge the most frequent pair into a new token
Repeat until the target vocabulary size is achieved

Unigram Language Model

Unigram is a probabilistic subword tokenization algorithm that starts with a large seed vocabulary and iteratively prunes tokens with the lowest likelihood scores using the Expectation-Maximization (EM) algorithm.

Key steps:

Generate a large initial vocabulary from all possible substrings
Compute likelihood scores for each subword in the corpus
Update subword probabilities using EM iterations
Prune lowest-scoring subwords until target vocabulary size is reached

ShredWord implements both algorithms efficiently in C/C++, exposing training and vocabulary management methods through Python.

Installation

Prerequisites

Python 3.11+
GCC or a compatible compiler (for building from source)

Steps

Install the Python package from PyPI.org:

pip install shredword-trainer

Usage

Below are examples demonstrating how to use ShredWord for training tokenizers with both BPE and Unigram algorithms.

BPE Trainer

from shredword.trainer import BPETrainer

trainer = BPETrainer(
  vocab_size=8192,
  unk_id=0,
  character_coverage=0.995,
  min_pair_freq=2000
)

trainer.load_corpus("data/corpus.txt")
trainer.train()
trainer.save("model/bpe.model", "model/bpe.vocab")
trainer.destroy()

Unigram Trainer

Note: Unigram implementation is currently under development and may not be fully functional.

from shredword.trainer import UnigramTrainer

trainer = UnigramTrainer(
  vocab_size=32000,
  character_coverage=0.9995,
  max_sentencepiece_length=16,
  seed_size=1000000
)

trainer.load_corpus("data/corpus.txt")
trainer.train(num_iterations=10)
trainer.save("model/unigram.vocab")
trainer.destroy()

Context Manager Pattern

from shredword.trainer import BPETrainer

with BPETrainer(vocab_size=16000) as trainer:
  trainer.load_corpus("data/corpus.txt")
  trainer.train()
  trainer.save("model/bpe.model", "model/bpe.vocab")

Multiple Corpus Training

from shredword.trainer import BPETrainer

trainer = BPETrainer(vocab_size=25000)

corpus_files = ["data/corpus1.txt", "data/corpus2.txt", "data/corpus3.txt"]
for corpus_file in corpus_files:
  trainer.load_corpus(corpus_file)

trainer.train()
trainer.save("model/multi.model", "model/multi.vocab")
trainer.destroy()

API Overview

BPETrainer

Constructor Parameters

vocab_size (int): Target vocabulary size. Default: 8192
unk_id (int): ID for unknown tokens. Default: 0
character_coverage (float): Character coverage ratio (0.0-1.0). Default: 0.995
min_pair_freq (int): Minimum frequency for pair merging. Default: 2000

Methods

load_corpus(path): Load training corpus from a text file
train(): Train the BPE model on loaded corpus
save(model_path, vocab_path): Save trained model and vocabulary
destroy(): Release trainer resources

UnigramTrainer

Constructor Parameters

vocab_size (int): Target vocabulary size. Default: 32000
character_coverage (float): Character coverage ratio (0.0-1.0). Default: 0.9995
max_sentencepiece_length (int): Maximum length of sentence pieces. Default: 16
seed_size (int): Initial seed vocabulary size. Default: 1000000

Methods

load_corpus(path): Load training corpus from a text file
train(num_iterations): Train the Unigram model using EM algorithm
save(vocab_path): Save trained vocabulary
destroy(): Release trainer resources

C/C++ CLI Usage

ShredWord also provides a command-line interface for training directly without Python.

Compilation

Windows:

g++ -o trainer.exe trainer.cpp bpe/bpe.cpp bpe/histogram.cpp bpe/hash.cpp bpe/heap.cpp unigram/unigram.cpp unigram/heap.cpp unigram/cache.cpp unigram/hashmap.cpp unigram/subword.cpp trie.cpp -I. -std=c++11

Linux:

g++ -o trainer.exe trainer.cpp bpe/bpe.cpp bpe/histogram.cpp bpe/hash.cpp bpe/heap.cpp unigram/unigram.cpp unigram/heap.cpp unigram/cache.cpp unigram/hashmap.cpp unigram/subword.cpp trie.cpp

Training with CLI

BPE:

trainer.exe input=corpus.txt model_type=bpe output_model=model.bin output_vocab=vocab.txt vocab_size=32000

Unigram:

trainer.exe input=corpus.txt model_type=unigram output_model=model.bin output_vocab=vocab.txt vocab_size=32000 num_iterations=10

Advanced Features

Error Handling

from shredword.trainer import BPETrainer

try:
  trainer = BPETrainer(vocab_size=10000)
  trainer.load_corpus("corpus.txt")
  trainer.train()
  trainer.save("model.model", "vocab.vocab")
except RuntimeError as e:
  print(f"Training error: {e}")
except IOError as e:
  print(f"File error: {e}")
finally:
  trainer.destroy()

Resource Management

Always call destroy() to properly clean up resources, or use the context manager pattern for automatic cleanup:

with BPETrainer(vocab_size=16000) as trainer:
  trainer.load_corpus("data.txt")
  trainer.train()
  trainer.save("model.model", "vocab.vocab")

Configuration Guidelines

Vocabulary Size

Small models: 8,192 - 16,384
Medium models: 32,000 - 50,000
Large models: 50,000 - 100,000

Character Coverage

English: 0.995 - 0.999
Multilingual: 0.9995 - 1.0

Minimum Pair Frequency (BPE)

Small corpus: 100 - 1,000
Large corpus: 2,000 - 10,000

EM Iterations (Unigram)

Quick training: 5 - 8
Standard training: 10 - 15
High quality: 15 - 20

File Formats

Input Corpus

Plain text files
UTF-8 encoding recommended
One sentence per line (typical)

Output Files

Model file (.model/.bin): Contains merge operations (BPE) or metadata (Unigram)
Vocabulary file (.vocab/.txt): Contains vocabulary mapping

Documentation

For detailed documentation on both BPE and Unigram trainers, including API references, configuration parameters, and troubleshooting guides, refer to:

Known Limitations

Unigram implementation is currently under development and may not function as expected
Maximum corpus size may be limited by available system memory
CLI interface has a maximum text limit for Unigram training

Project Information

A project by Shivendra

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.1.0

Dec 18, 2025

0.0.3

Dec 18, 2025

0.0.2

Nov 1, 2025

0.0.1

Jul 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword_trainer-0.1.0.tar.gz (50.5 kB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shredword_trainer-0.1.0-cp313-cp313-win_amd64.whl (63.7 kB view details)

Uploaded Dec 18, 2025 CPython 3.13Windows x86-64

File details

Details for the file shredword_trainer-0.1.0.tar.gz.

File metadata

Download URL: shredword_trainer-0.1.0.tar.gz
Upload date: Dec 18, 2025
Size: 50.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for shredword_trainer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ef57062d7db4604d3f36cb1d195a920c879b9831e51f02827d4179d218c16fe5`
MD5	`3b1ee04259c905de67b5ca3c6fa0f759`
BLAKE2b-256	`dedcdb3b7e1726db143f2338d8d22d24755575d976bb37913dda14695d1a1622`

See more details on using hashes here.

File details

Details for the file shredword_trainer-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

Download URL: shredword_trainer-0.1.0-cp313-cp313-win_amd64.whl
Upload date: Dec 18, 2025
Size: 63.7 kB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for shredword_trainer-0.1.0-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`81631ea6795dd73baf9926eaf6ec045281bdf63b4c018915c5316d33f5a4d7a2`
MD5	`494064c76021d486dc0b67fa4764bcfe`
BLAKE2b-256	`96fa32c21623ae5c22b005ec0cb009423d807c3a7f891f086b2697629f486f7f`

See more details on using hashes here.

shredword-trainer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ShredWord

Features

How It Works

Byte-Pair Encoding (BPE)

Unigram Language Model

Installation

Prerequisites

Steps

Usage

BPE Trainer

Unigram Trainer

Context Manager Pattern

Multiple Corpus Training

API Overview

BPETrainer

Constructor Parameters

Methods

UnigramTrainer

Constructor Parameters

Methods

C/C++ CLI Usage

Compilation

Training with CLI

Advanced Features

Error Handling

Resource Management

Configuration Guidelines

Vocabulary Size

Character Coverage

Minimum Pair Frequency (BPE)

EM Iterations (Unigram)

File Formats

Input Corpus

Output Files

Documentation

Known Limitations

Project Information

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes