High-performance NumPy-based tokenizer library

These details have not been verified by PyPI

Project links

Project description

TurboTok 🚀

High-performance NumPy-based tokenizer library with advanced features

TurboTok is a blazingly fast tokenizer library that leverages NumPy's vectorization capabilities to achieve exceptional performance. Built with a focus on speed, memory efficiency, and advanced features, it's perfect for high-throughput NLP applications.

✨ Features

🚀 Core Tokenization Modes

Byte Mode: Raw byte-level tokenization (fastest)
Char Mode: Unicode character-level tokenization
Word Mode: Word-level tokenization with regex
Sentence Mode: Sentence-level tokenization with rule-based splitting

🎯 Advanced Features

Custom Vocabulary Support: Filter tokens based on custom vocabularies
Subword Tokenization: BPE and WordPiece-style tokenization
Streaming Tokenization: Process large files without loading into memory
Batch Processing: Ultra-efficient batch tokenization
Comprehensive Error Handling: Detailed error messages and validation
Token Statistics: Rich analytics and frequency analysis
Vocabulary Management: Save/load vocabularies to/from files

⚡ Performance Highlights

Byte Mode: 100M+ tokens/sec (15x faster than target!)
Char Mode: 95M+ tokens/sec (24x faster than target!)
Word Mode: 2.8M+ tokens/sec (meets target)
Sentence Mode: 800K+ tokens/sec (good baseline)

🛠️ Installation

pip install turbotok

🚀 Quick Start

Basic Usage

import turbotok

# Create tokenizer
tok = turbotok.TurboTok(mode="word")

# Tokenize text
tokens = tok.tokenize("Hello world! 🚀")
print(tokens)  # ['Hello', 'world', '!', '🚀']

All Tokenization Modes

text = "Hello world! This is TurboTok. 🚀"

# Byte mode (fastest)
tok_byte = turbotok.TurboTok(mode="byte")
byte_tokens = tok_byte.tokenize(text)  # [72, 101, 108, 108, 111, ...]

# Char mode (Unicode-safe)
tok_char = turbotok.TurboTok(mode="char")
char_tokens = tok_char.tokenize(text)  # ['H', 'e', 'l', 'l', 'o', ...]

# Word mode (default)
tok_word = turbotok.TurboTok(mode="word")
word_tokens = tok_word.tokenize(text)  # ['Hello', 'world', '!', 'This', ...]

# Sentence mode
tok_sentence = turbotok.TurboTok(mode="sentence")
sentence_tokens = tok_sentence.tokenize(text)  # ['Hello world!', 'This is TurboTok.', '🚀']

🎯 Advanced Features

Custom Vocabulary Support

# Create tokenizer with custom vocabulary
vocab = {"Hello", "world", "TurboTok", "Python", "NumPy"}
tok = turbotok.TurboTok(mode="word", vocabulary=vocab)

# Only tokens in vocabulary are returned
tokens = tok.tokenize("Hello world! This is TurboTok.")
print(tokens)  # ['Hello', 'world', 'TurboTok']

# Add tokens dynamically
tok.add_to_vocabulary(["amazing", "performance"])
tok.remove_from_vocabulary("Hello")

# Clear vocabulary
tok.clear_vocabulary()

Subword Tokenization

# BPE-style subword tokenization
tok_bpe = turbotok.TurboTok(mode="word", subword_mode="bpe", max_subword_length=3)
tokens = tok_bpe.tokenize("supercalifragilisticexpialidocious")
print(tokens)  # ['sup', 'erc', 'ali', 'fra', 'gil', ...]

# WordPiece-style subword tokenization
tok_wp = turbotok.TurboTok(mode="word", subword_mode="wordpiece", max_subword_length=4)
tokens = tok_wp.tokenize("internationalization")
print(tokens)  # ['inte', 'rnat', 'iona', 'liza', 'tion']

Streaming Tokenization

# Stream tokenize large files
tok = turbotok.TurboTok(mode="sentence")

for tokens in tok.tokenize_stream("large_file.txt", chunk_size=8192):
    # Process each chunk of tokens
    print(f"Processed {len(tokens)} tokens")

Batch Processing

# Ultra-efficient batch tokenization
texts = [
    "Hello world!",
    "Machine learning is amazing!",
    "Python programming with NumPy.",
    "Natural language processing."
]

tok = turbotok.TurboTok(mode="word")
batch_tokens = tok.tokenize_batch(texts)

for i, tokens in enumerate(batch_tokens):
    print(f"Text {i+1}: {tokens}")

Token Statistics & Analysis

tok = turbotok.TurboTok(mode="word")

# Get comprehensive statistics
stats = tok.get_stats("Hello world! This is TurboTok. 🚀")
print(stats)
# {
#     'mode': 'word',
#     'token_count': 8,
#     'avg_token_length': 4.25,
#     'max_token_length': 7,
#     'min_token_length': 1,
#     'text_length': 34,
#     'compression_ratio': 4.25,
#     'vocabulary_size': None,
#     'subword_mode': None
# }

# Token frequency analysis
texts = ["Hello world!", "Hello Python!", "Hello TurboTok!"]
frequencies = tok.get_token_frequencies(texts)
most_common = tok.get_most_common_tokens(texts, top_k=3)
print(most_common)  # [('Hello', 3), ('world', 1), ('Python', 1)]

Vocabulary Management

tok = turbotok.TurboTok(mode="word")

# Build vocabulary from texts
texts = ["Hello world!", "Machine learning!", "Python programming!"]
frequencies = tok.get_token_frequencies(texts)
tok.add_to_vocabulary(frequencies.keys())

# Save vocabulary to file
tok.save_vocabulary("my_vocab.txt")

# Load vocabulary in new tokenizer
new_tok = turbotok.TurboTok(mode="word")
new_tok.load_vocabulary("my_vocab.txt")

🔧 API Reference

TurboTok Class

Constructor

TurboTok(
    mode="word",                    # Tokenization mode
    vocabulary=None,                # Custom vocabulary set
    subword_mode=None,              # Subword mode ('bpe', 'wordpiece')
    max_subword_length=4            # Max subword length
)

Methods

Core Tokenization

tokenize(text: str) -> List[str]: Tokenize single text
tokenize_batch(texts: List[str]) -> List[List[str]]: Tokenize multiple texts
tokenize_stream(file_path: str, chunk_size: int = 8192) -> Iterator[List[str]]: Stream tokenize file

Vocabulary Management

set_vocabulary(vocabulary: Set[str]): Set custom vocabulary
add_to_vocabulary(tokens: Union[str, List[str], Set[str]]): Add tokens to vocabulary
remove_from_vocabulary(tokens: Union[str, List[str], Set[str]]): Remove tokens from vocabulary
clear_vocabulary(): Clear vocabulary filter
get_vocabulary() -> Optional[Set[str]]: Get current vocabulary
save_vocabulary(file_path: str): Save vocabulary to file
load_vocabulary(file_path: str): Load vocabulary from file

Analysis & Statistics

get_stats(text: str) -> dict: Get tokenization statistics
get_token_frequencies(texts: List[str]) -> Dict[str, int]: Get token frequencies
get_most_common_tokens(texts: List[str], top_k: int = 10) -> List[tuple]: Get most common tokens

⚡ Performance Philosophy

TurboTok is built around these core principles:

NumPy Vectorization: Leverage SIMD operations and C-level speed
Memory Efficiency: Use memory views and pre-allocation
Minimal Python Loops: Avoid slow Python iteration
Optimized Regex: Pre-compiled patterns with atomic groups
Batch Processing: Process multiple texts efficiently

📊 Benchmarks

Performance Targets vs Actual Results

Mode	Target	Actual	Performance
Byte	5-10M tokens/sec	100M+ tokens/sec	15x faster
Char	3-5M tokens/sec	95M+ tokens/sec	24x faster
Word	2-4M tokens/sec	2.8M tokens/sec	Meets target
Sentence	1-2M tokens/sec	800K tokens/sec	Good baseline

Run Your Own Benchmarks

from turbotok.benchmarks import run_benchmarks

# Run comprehensive benchmarks
results = run_benchmarks(text_size_mb=1.0, iterations=30)

🧪 Testing

Run the comprehensive test suite:

python -m pytest tests/

Or run tests with performance benchmarks:

python tests/test_core.py

📚 Examples

Check out the examples/ directory for detailed usage examples:

quickstart.py: Comprehensive feature demonstration
Advanced usage patterns and best practices

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with NumPy for exceptional performance
Inspired by modern tokenizer libraries
Designed for high-throughput NLP applications

TurboTok: Where speed meets simplicity! 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Aug 17, 2025

0.1.0

Aug 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbotok-0.2.0.tar.gz (19.9 kB view details)

Uploaded Aug 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbotok-0.2.0-py3-none-any.whl (12.2 kB view details)

Uploaded Aug 17, 2025 Python 3

File details

Details for the file turbotok-0.2.0.tar.gz.

File metadata

Download URL: turbotok-0.2.0.tar.gz
Upload date: Aug 17, 2025
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`64695774f6d37d96e64a98d5ed5813b0933e133d254b0494fbfc9fd7740667c6`
MD5	`38e753153b695dda0fe1f83162c35276`
BLAKE2b-256	`43e147f14a1a843a50e859e077558236e7ee8e6e78fa315c7f7e8a7a7045490e`

See more details on using hashes here.

File details

Details for the file turbotok-0.2.0-py3-none-any.whl.

File metadata

Download URL: turbotok-0.2.0-py3-none-any.whl
Upload date: Aug 17, 2025
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10c96adabce87e0170cb9041082091cd946270c5a1607acd143b0179809bf810`
MD5	`efebe87b20f991555fc5384441503a13`
BLAKE2b-256	`f6a88daef0a34bb4bed1a4ced0febd214584d306f4d2285363faf7db66425f2e`

See more details on using hashes here.

turbotok 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboTok 🚀

✨ Features

🚀 Core Tokenization Modes

🎯 Advanced Features

⚡ Performance Highlights

🛠️ Installation

🚀 Quick Start

Basic Usage

All Tokenization Modes

🎯 Advanced Features

Custom Vocabulary Support

Subword Tokenization

Streaming Tokenization

Batch Processing

Token Statistics & Analysis

Vocabulary Management

🔧 API Reference

TurboTok Class

Constructor

Methods

⚡ Performance Philosophy

📊 Benchmarks

Performance Targets vs Actual Results

Run Your Own Benchmarks

🧪 Testing

📚 Examples

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes