A blazing fast Byte Pair Encoding (BPE) tokenizer library with Python bindings
Project description
FIBpeTokenizer 🚀
A blazing fast Byte Pair Encoding (BPE) tokenizer library written in Rust with Python bindings.
Features ✨
- 🔥 Blazing Fast: Written in Rust with parallel processing support
- 🐍 Python Support: Use it in Python via PyO3 bindings
- 🎯 Flexible Pre-tokenization: Choose between whitespace or punctuation-based splitting
- 🔖 Special Token Handling: Built-in support for special tokens like
<pad>,<mask>, etc. - 💾 Save/Load Models: Train once, reuse anywhere
- 🔧 Customizable: Configure vocabulary size, special tokens, and more
Installation
Rust
Add this to your Cargo.toml:
[dependencies]
fibpetokenizer = "0.1.0"
Python
pip install fibpetokenizer
Quick Start
Rust Usage
use fibpetokenizer::{BpeTokenizer, PreTokenization, SpecialTokenRemovalMethod};
fn main() {
// Define special tokens
let special_tokens = vec![
"<pad>".to_string(),
"<mask>".to_string(),
"<unk>".to_string()
];
// Create and train tokenizer
let mut tokenizer = BpeTokenizer::new(
"corpus.txt", // Input text file
10000, // Target vocabulary size
PreTokenization::Punctuation, // Pre-tokenization strategy
special_tokens, // Special tokens
SpecialTokenRemovalMethod::AhoCorasick, // Special token removal method
true, // Save model after training
Some("output_dir") // Output directory
);
// Train the tokenizer
tokenizer.train().unwrap();
// Encode text
let text = "Hello, world! This is a test.";
let encoder = tokenizer.encode(text).unwrap();
println!("Tokens: {:?}", encoder.tokens);
println!("Token IDs: {:?}", encoder.ids);
println!("Token Types: {:?}", encoder.token_types);
// Decode back to text
let decoded = tokenizer.decode(&encoder.ids).unwrap();
println!("Decoded: {}", decoded);
// Load a pretrained tokenizer
let loaded_tokenizer = BpeTokenizer::new_from_pretrained("output_dir");
}
Python Usage
from fibpetokenizer import (
BpeTokenizer,
PreTokenization,
SpecialTokenRemovalMethod
)
# Define special tokens
special_tokens = ["<pad>", "<mask>", "<unk>"]
# Create tokenizer
tokenizer = BpeTokenizer(
input_path="corpus.txt",
target_vocab_size=10000,
pretokenization_type=PreTokenization.punctuation(),
special_tokens=special_tokens,
special_token_removal_method=SpecialTokenRemovalMethod.aho_corasick(),
save_model=True,
output_dir="output_dir"
)
# Train the tokenizer
tokenizer.train()
# Encode text
text = "Hello, world! This is a test."
encoder = tokenizer.encode(text)
print("Tokens:", encoder.tokens)
print("Token IDs:", encoder.ids)
print("Token Types:", encoder.token_types)
# Decode back to text
decoded = tokenizer.decode(encoder.ids)
print("Decoded:", decoded)
# Load a pretrained tokenizer
loaded_tokenizer = BpeTokenizer.from_pretrained("output_dir")
API Reference
BpeTokenizer
The main tokenizer class.
Constructor
BpeTokenizer::new(
input_path: &str,
target_vocab_size: usize,
pretokenization_type: PreTokenization,
special_tokens: Vec<String>,
special_token_removal_method: SpecialTokenRemovalMethod,
save_model: bool,
output_dir: Option<&str>
) -> Self
Methods
train(&mut self) -> Result<(), TokenizerError>: Train the tokenizer on the corpusencode(&self, text: &str) -> Result<Encoder, TokenizerError>: Encode text into tokens and IDsdecode(&self, ids: &Vec<u32>) -> Result<String, TokenizerError>: Decode token IDs back to textnew_from_pretrained(files_path: &str) -> Self: Load a pretrained tokenizerget_id_by_token(&self, token: String) -> Result<u32, TokenizerError>: Get ID for a tokenget_token_by_id(&self, id: u32) -> Result<String, TokenizerError>: Get token for an ID
PreTokenization
Pre-tokenization strategies:
PreTokenization::Whitespace: Split on whitespacePreTokenization::Punctuation: Split on whitespace and punctuation
SpecialTokenRemovalMethod
Methods for removing special tokens from the training corpus:
SpecialTokenRemovalMethod::Simple: Simple string replacementSpecialTokenRemovalMethod::AhoCorasick: Fast multi-pattern search using Aho-Corasick algorithm
Encoder
The result of encoding text.
Fields
original_text: String: The original input texttokens: Vec<String>: The tokenized representationids: Vec<u32>: Token IDstoken_types: Vec<TokenType>: Type of each token (WORD, SUBWORD, or SPECIALTOKEN)
Methods
get_token_type(&self, token: &str) -> Result<TokenType, TokenizerError>: Get the type of a specific token
How It Works
BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequent pair of bytes (or characters) in a sequence. This library:
- Pre-tokenizes the input text based on the selected strategy
- Builds an initial vocabulary from individual characters
- Iteratively merges the most frequent adjacent token pairs
- Stops when the target vocabulary size is reached
- Saves the vocabulary, merge rules, and configuration for later use
Performance
- Parallel processing using Rayon for fast training
- Efficient special token removal using Aho-Corasick algorithm
- Optimized data structures for merge operations
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Credits
Developed with ❤️ using Rust and PyO3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fibpetokenizer-0.1.0.tar.gz.
File metadata
- Download URL: fibpetokenizer-0.1.0.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5214e2c2b016bb156791f27794c2be256210cdfe90bbe1d132b0137a7cbc8800
|
|
| MD5 |
1a38995ca2792d7678e0b0206e5a09ed
|
|
| BLAKE2b-256 |
84c98fd0d7221e9345a7c1bd20e1bd168a3aabae1b7fdc515f111b377ff86afe
|
File details
Details for the file fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 423.2 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6461977f48efe405edd04a2f9761720165ae53d743b55a2ebde4984a8b19872
|
|
| MD5 |
f4d4bee640ea8043584566b2070d64e2
|
|
| BLAKE2b-256 |
2f511a39d31a9cbbe05e630909a886c111b62eff902a7c9691557d94c69f59db
|