A blazing fast Byte Pair Encoding (BPE) tokenizer library with Python bindings

These details have not been verified by PyPI

Project links

Project description

FIBpeTokenizer 🚀

A blazing fast Byte Pair Encoding (BPE) tokenizer library written in Rust with Python bindings.

Features ✨

🔥 Blazing Fast: Written in Rust with parallel processing support
🐍 Python Support: Use it in Python via PyO3 bindings
🎯 Flexible Pre-tokenization: Choose between whitespace or punctuation-based splitting
🔖 Special Token Handling: Built-in support for special tokens like <pad>, <mask>, etc.
💾 Save/Load Models: Train once, reuse anywhere
🔧 Customizable: Configure vocabulary size, special tokens, and more

Installation

Rust

Add this to your Cargo.toml:

[dependencies]
fibpetokenizer = "0.1.0"

Python

pip install fibpetokenizer

Quick Start

Rust Usage

use fibpetokenizer::{BpeTokenizer, PreTokenization, SpecialTokenRemovalMethod};

fn main() {
    // Define special tokens
    let special_tokens = vec![
        "<pad>".to_string(),
        "<mask>".to_string(),
        "<unk>".to_string()
    ];

    // Create and train tokenizer
    let mut tokenizer = BpeTokenizer::new(
        "corpus.txt",                           // Input text file
        10000,                                   // Target vocabulary size
        PreTokenization::Punctuation,           // Pre-tokenization strategy
        special_tokens,                          // Special tokens
        SpecialTokenRemovalMethod::AhoCorasick, // Special token removal method
        true,                                    // Save model after training
        Some("output_dir")                      // Output directory
    );

    // Train the tokenizer
    tokenizer.train().unwrap();

    // Encode text
    let text = "Hello, world! This is a test.";
    let encoder = tokenizer.encode(text).unwrap();
    
    println!("Tokens: {:?}", encoder.tokens);
    println!("Token IDs: {:?}", encoder.ids);
    println!("Token Types: {:?}", encoder.token_types);

    // Decode back to text
    let decoded = tokenizer.decode(&encoder.ids).unwrap();
    println!("Decoded: {}", decoded);

    // Load a pretrained tokenizer
    let loaded_tokenizer = BpeTokenizer::new_from_pretrained("output_dir");
}

Python Usage

from fibpetokenizer import (
    BpeTokenizer,
    PreTokenization,
    SpecialTokenRemovalMethod
)

# Define special tokens
special_tokens = ["<pad>", "<mask>", "<unk>"]

# Create tokenizer
tokenizer = BpeTokenizer(
    input_path="corpus.txt",
    target_vocab_size=10000,
    pretokenization_type=PreTokenization.punctuation(),
    special_tokens=special_tokens,
    special_token_removal_method=SpecialTokenRemovalMethod.aho_corasick(),
    save_model=True,
    output_dir="output_dir"
)

# Train the tokenizer
tokenizer.train()

# Encode text
text = "Hello, world! This is a test."
encoder = tokenizer.encode(text)

print("Tokens:", encoder.tokens)
print("Token IDs:", encoder.ids)
print("Token Types:", encoder.token_types)

# Decode back to text
decoded = tokenizer.decode(encoder.ids)
print("Decoded:", decoded)

# Load a pretrained tokenizer
loaded_tokenizer = BpeTokenizer.from_pretrained("output_dir")

API Reference

`BpeTokenizer`

The main tokenizer class.

Constructor

BpeTokenizer::new(
    input_path: &str,
    target_vocab_size: usize,
    pretokenization_type: PreTokenization,
    special_tokens: Vec<String>,
    special_token_removal_method: SpecialTokenRemovalMethod,
    save_model: bool,
    output_dir: Option<&str>
) -> Self

Methods

train(&mut self) -> Result<(), TokenizerError>: Train the tokenizer on the corpus
encode(&self, text: &str) -> Result<Encoder, TokenizerError>: Encode text into tokens and IDs
decode(&self, ids: &Vec<u32>) -> Result<String, TokenizerError>: Decode token IDs back to text
new_from_pretrained(files_path: &str) -> Self: Load a pretrained tokenizer
get_id_by_token(&self, token: String) -> Result<u32, TokenizerError>: Get ID for a token
get_token_by_id(&self, id: u32) -> Result<String, TokenizerError>: Get token for an ID

`PreTokenization`

Pre-tokenization strategies:

PreTokenization::Whitespace: Split on whitespace
PreTokenization::Punctuation: Split on whitespace and punctuation

`SpecialTokenRemovalMethod`

Methods for removing special tokens from the training corpus:

SpecialTokenRemovalMethod::Simple: Simple string replacement
SpecialTokenRemovalMethod::AhoCorasick: Fast multi-pattern search using Aho-Corasick algorithm

`Encoder`

The result of encoding text.

Fields

original_text: String: The original input text
tokens: Vec<String>: The tokenized representation
ids: Vec<u32>: Token IDs
token_types: Vec<TokenType>: Type of each token (WORD, SUBWORD, or SPECIALTOKEN)

Methods

get_token_type(&self, token: &str) -> Result<TokenType, TokenizerError>: Get the type of a specific token

How It Works

BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequent pair of bytes (or characters) in a sequence. This library:

Pre-tokenizes the input text based on the selected strategy
Builds an initial vocabulary from individual characters
Iteratively merges the most frequent adjacent token pairs
Stops when the target vocabulary size is reached
Saves the vocabulary, merge rules, and configuration for later use

Performance

Parallel processing using Rayon for fast training
Efficient special token removal using Aho-Corasick algorithm
Optimized data structures for merge operations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT License (LICENSE-MIT)

at your option.

Credits

Developed with ❤️ using Rust and PyO3.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Mar 4, 2026

This version

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fibpetokenizer-0.1.0.tar.gz (26.3 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl (423.2 kB view details)

Uploaded Mar 4, 2026 CPython 3.12Windows x86-64

File details

Details for the file fibpetokenizer-0.1.0.tar.gz.

File metadata

Download URL: fibpetokenizer-0.1.0.tar.gz
Upload date: Mar 4, 2026
Size: 26.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for fibpetokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5214e2c2b016bb156791f27794c2be256210cdfe90bbe1d132b0137a7cbc8800`
MD5	`1a38995ca2792d7678e0b0206e5a09ed`
BLAKE2b-256	`84c98fd0d7221e9345a7c1bd20e1bd168a3aabae1b7fdc515f111b377ff86afe`

See more details on using hashes here.

File details

Details for the file fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

Download URL: fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl
Upload date: Mar 4, 2026
Size: 423.2 kB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`c6461977f48efe405edd04a2f9761720165ae53d743b55a2ebde4984a8b19872`
MD5	`f4d4bee640ea8043584566b2070d64e2`
BLAKE2b-256	`2f511a39d31a9cbbe05e630909a886c111b62eff902a7c9691557d94c69f59db`

See more details on using hashes here.

fibpetokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FIBpeTokenizer 🚀

Features ✨

Installation

Rust

Python

Quick Start

Rust Usage

Python Usage

API Reference

BpeTokenizer

Constructor

Methods

PreTokenization

SpecialTokenRemovalMethod

Encoder

Fields

Methods

How It Works

Performance

Contributing

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`BpeTokenizer`

`PreTokenization`

`SpecialTokenRemovalMethod`

`Encoder`