Skip to main content

A blazing fast Byte Pair Encoding (BPE) tokenizer library with Python bindings

Project description

FIBpeTokenizer 🚀

A blazing fast Byte Pair Encoding (BPE) tokenizer library written in Rust with Python bindings.

Features ✨

  • 🔥 Blazing Fast: Written in Rust with parallel processing support
  • 🐍 Python Support: Use it in Python via PyO3 bindings
  • 🎯 Flexible Pre-tokenization: Choose between whitespace or punctuation-based splitting
  • 🔖 Special Token Handling: Built-in support for special tokens like <pad>, <mask>, etc.
  • 💾 Save/Load Models: Train once, reuse anywhere
  • 🔧 Customizable: Configure vocabulary size, special tokens, and more

Installation

Rust

Add this to your Cargo.toml:

[dependencies]
fibpetokenizer = "0.1.0"

Python

pip install fibpetokenizer

Quick Start

Rust Usage

use fibpetokenizer::{BpeTokenizer, PreTokenization, SpecialTokenRemovalMethod};

fn main() {
    // Define special tokens
    let special_tokens = vec![
        "<pad>".to_string(),
        "<mask>".to_string(),
        "<unk>".to_string()
    ];

    // Create and train tokenizer
    let mut tokenizer = BpeTokenizer::new(
        "corpus.txt",                           // Input text file
        10000,                                   // Target vocabulary size
        PreTokenization::Punctuation,           // Pre-tokenization strategy
        special_tokens,                          // Special tokens
        SpecialTokenRemovalMethod::AhoCorasick, // Special token removal method
        true,                                    // Save model after training
        Some("output_dir")                      // Output directory
    );

    // Train the tokenizer
    tokenizer.train().unwrap();

    // Encode text
    let text = "Hello, world! This is a test.";
    let encoder = tokenizer.encode(text).unwrap();
    
    println!("Tokens: {:?}", encoder.tokens);
    println!("Token IDs: {:?}", encoder.ids);
    println!("Token Types: {:?}", encoder.token_types);

    // Decode back to text
    let decoded = tokenizer.decode(&encoder.ids).unwrap();
    println!("Decoded: {}", decoded);

    // Load a pretrained tokenizer
    let loaded_tokenizer = BpeTokenizer::new_from_pretrained("output_dir");
}

Python Usage

from fibpetokenizer import (
    BpeTokenizer,
    PreTokenization,
    SpecialTokenRemovalMethod
)

# Define special tokens
special_tokens = ["<pad>", "<mask>", "<unk>"]

# Create tokenizer
tokenizer = BpeTokenizer(
    input_path="corpus.txt",
    target_vocab_size=10000,
    pretokenization_type=PreTokenization.punctuation(),
    special_tokens=special_tokens,
    special_token_removal_method=SpecialTokenRemovalMethod.aho_corasick(),
    save_model=True,
    output_dir="output_dir"
)

# Train the tokenizer
tokenizer.train()

# Encode text
text = "Hello, world! This is a test."
encoder = tokenizer.encode(text)

print("Tokens:", encoder.tokens)
print("Token IDs:", encoder.ids)
print("Token Types:", encoder.token_types)

# Decode back to text
decoded = tokenizer.decode(encoder.ids)
print("Decoded:", decoded)

# Load a pretrained tokenizer
loaded_tokenizer = BpeTokenizer.from_pretrained("output_dir")

API Reference

BpeTokenizer

The main tokenizer class.

Constructor

BpeTokenizer::new(
    input_path: &str,
    target_vocab_size: usize,
    pretokenization_type: PreTokenization,
    special_tokens: Vec<String>,
    special_token_removal_method: SpecialTokenRemovalMethod,
    save_model: bool,
    output_dir: Option<&str>
) -> Self

Methods

  • train(&mut self) -> Result<(), TokenizerError>: Train the tokenizer on the corpus
  • encode(&self, text: &str) -> Result<Encoder, TokenizerError>: Encode text into tokens and IDs
  • decode(&self, ids: &Vec<u32>) -> Result<String, TokenizerError>: Decode token IDs back to text
  • new_from_pretrained(files_path: &str) -> Self: Load a pretrained tokenizer
  • get_id_by_token(&self, token: String) -> Result<u32, TokenizerError>: Get ID for a token
  • get_token_by_id(&self, id: u32) -> Result<String, TokenizerError>: Get token for an ID

PreTokenization

Pre-tokenization strategies:

  • PreTokenization::Whitespace: Split on whitespace
  • PreTokenization::Punctuation: Split on whitespace and punctuation

SpecialTokenRemovalMethod

Methods for removing special tokens from the training corpus:

  • SpecialTokenRemovalMethod::Simple: Simple string replacement
  • SpecialTokenRemovalMethod::AhoCorasick: Fast multi-pattern search using Aho-Corasick algorithm

Encoder

The result of encoding text.

Fields

  • original_text: String: The original input text
  • tokens: Vec<String>: The tokenized representation
  • ids: Vec<u32>: Token IDs
  • token_types: Vec<TokenType>: Type of each token (WORD, SUBWORD, or SPECIALTOKEN)

Methods

  • get_token_type(&self, token: &str) -> Result<TokenType, TokenizerError>: Get the type of a specific token

How It Works

BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequent pair of bytes (or characters) in a sequence. This library:

  1. Pre-tokenizes the input text based on the selected strategy
  2. Builds an initial vocabulary from individual characters
  3. Iteratively merges the most frequent adjacent token pairs
  4. Stops when the target vocabulary size is reached
  5. Saves the vocabulary, merge rules, and configuration for later use

Performance

  • Parallel processing using Rayon for fast training
  • Efficient special token removal using Aho-Corasick algorithm
  • Optimized data structures for merge operations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under either of:

at your option.

Credits

Developed with ❤️ using Rust and PyO3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fibpetokenizer-0.1.0.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl (423.2 kB view details)

Uploaded CPython 3.12Windows x86-64

File details

Details for the file fibpetokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: fibpetokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for fibpetokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5214e2c2b016bb156791f27794c2be256210cdfe90bbe1d132b0137a7cbc8800
MD5 1a38995ca2792d7678e0b0206e5a09ed
BLAKE2b-256 84c98fd0d7221e9345a7c1bd20e1bd168a3aabae1b7fdc515f111b377ff86afe

See more details on using hashes here.

File details

Details for the file fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for fibpetokenizer-0.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c6461977f48efe405edd04a2f9761720165ae53d743b55a2ebde4984a8b19872
MD5 f4d4bee640ea8043584566b2070d64e2
BLAKE2b-256 2f511a39d31a9cbbe05e630909a886c111b62eff902a7c9691557d94c69f59db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page