High-performance NumPy-based tokenizer library
Project description
TurboTok 🚀
High-performance NumPy-based tokenizer library with advanced features
TurboTok is a blazingly fast tokenizer library that leverages NumPy's vectorization capabilities to achieve exceptional performance. Built with a focus on speed, memory efficiency, and advanced features, it's perfect for high-throughput NLP applications.
✨ Features
🚀 Core Tokenization Modes
- Byte Mode: Raw byte-level tokenization (fastest)
- Char Mode: Unicode character-level tokenization
- Word Mode: Word-level tokenization with regex
- Sentence Mode: Sentence-level tokenization with rule-based splitting
🎯 Advanced Features
- Custom Vocabulary Support: Filter tokens based on custom vocabularies
- Subword Tokenization: BPE and WordPiece-style tokenization
- Streaming Tokenization: Process large files without loading into memory
- Batch Processing: Ultra-efficient batch tokenization
- Comprehensive Error Handling: Detailed error messages and validation
- Token Statistics: Rich analytics and frequency analysis
- Vocabulary Management: Save/load vocabularies to/from files
⚡ Performance Highlights
- Byte Mode: 100M+ tokens/sec (15x faster than target!)
- Char Mode: 95M+ tokens/sec (24x faster than target!)
- Word Mode: 2.8M+ tokens/sec (meets target)
- Sentence Mode: 800K+ tokens/sec (good baseline)
🛠️ Installation
pip install turbotok
🚀 Quick Start
Basic Usage
import turbotok
# Create tokenizer
tok = turbotok.TurboTok(mode="word")
# Tokenize text
tokens = tok.tokenize("Hello world! 🚀")
print(tokens) # ['Hello', 'world', '!', '🚀']
All Tokenization Modes
text = "Hello world! This is TurboTok. 🚀"
# Byte mode (fastest)
tok_byte = turbotok.TurboTok(mode="byte")
byte_tokens = tok_byte.tokenize(text) # [72, 101, 108, 108, 111, ...]
# Char mode (Unicode-safe)
tok_char = turbotok.TurboTok(mode="char")
char_tokens = tok_char.tokenize(text) # ['H', 'e', 'l', 'l', 'o', ...]
# Word mode (default)
tok_word = turbotok.TurboTok(mode="word")
word_tokens = tok_word.tokenize(text) # ['Hello', 'world', '!', 'This', ...]
# Sentence mode
tok_sentence = turbotok.TurboTok(mode="sentence")
sentence_tokens = tok_sentence.tokenize(text) # ['Hello world!', 'This is TurboTok.', '🚀']
🎯 Advanced Features
Custom Vocabulary Support
# Create tokenizer with custom vocabulary
vocab = {"Hello", "world", "TurboTok", "Python", "NumPy"}
tok = turbotok.TurboTok(mode="word", vocabulary=vocab)
# Only tokens in vocabulary are returned
tokens = tok.tokenize("Hello world! This is TurboTok.")
print(tokens) # ['Hello', 'world', 'TurboTok']
# Add tokens dynamically
tok.add_to_vocabulary(["amazing", "performance"])
tok.remove_from_vocabulary("Hello")
# Clear vocabulary
tok.clear_vocabulary()
Subword Tokenization
# BPE-style subword tokenization
tok_bpe = turbotok.TurboTok(mode="word", subword_mode="bpe", max_subword_length=3)
tokens = tok_bpe.tokenize("supercalifragilisticexpialidocious")
print(tokens) # ['sup', 'erc', 'ali', 'fra', 'gil', ...]
# WordPiece-style subword tokenization
tok_wp = turbotok.TurboTok(mode="word", subword_mode="wordpiece", max_subword_length=4)
tokens = tok_wp.tokenize("internationalization")
print(tokens) # ['inte', 'rnat', 'iona', 'liza', 'tion']
Streaming Tokenization
# Stream tokenize large files
tok = turbotok.TurboTok(mode="sentence")
for tokens in tok.tokenize_stream("large_file.txt", chunk_size=8192):
# Process each chunk of tokens
print(f"Processed {len(tokens)} tokens")
Batch Processing
# Ultra-efficient batch tokenization
texts = [
"Hello world!",
"Machine learning is amazing!",
"Python programming with NumPy.",
"Natural language processing."
]
tok = turbotok.TurboTok(mode="word")
batch_tokens = tok.tokenize_batch(texts)
for i, tokens in enumerate(batch_tokens):
print(f"Text {i+1}: {tokens}")
Token Statistics & Analysis
tok = turbotok.TurboTok(mode="word")
# Get comprehensive statistics
stats = tok.get_stats("Hello world! This is TurboTok. 🚀")
print(stats)
# {
# 'mode': 'word',
# 'token_count': 8,
# 'avg_token_length': 4.25,
# 'max_token_length': 7,
# 'min_token_length': 1,
# 'text_length': 34,
# 'compression_ratio': 4.25,
# 'vocabulary_size': None,
# 'subword_mode': None
# }
# Token frequency analysis
texts = ["Hello world!", "Hello Python!", "Hello TurboTok!"]
frequencies = tok.get_token_frequencies(texts)
most_common = tok.get_most_common_tokens(texts, top_k=3)
print(most_common) # [('Hello', 3), ('world', 1), ('Python', 1)]
Vocabulary Management
tok = turbotok.TurboTok(mode="word")
# Build vocabulary from texts
texts = ["Hello world!", "Machine learning!", "Python programming!"]
frequencies = tok.get_token_frequencies(texts)
tok.add_to_vocabulary(frequencies.keys())
# Save vocabulary to file
tok.save_vocabulary("my_vocab.txt")
# Load vocabulary in new tokenizer
new_tok = turbotok.TurboTok(mode="word")
new_tok.load_vocabulary("my_vocab.txt")
🔧 API Reference
TurboTok Class
Constructor
TurboTok(
mode="word", # Tokenization mode
vocabulary=None, # Custom vocabulary set
subword_mode=None, # Subword mode ('bpe', 'wordpiece')
max_subword_length=4 # Max subword length
)
Methods
Core Tokenization
tokenize(text: str) -> List[str]: Tokenize single texttokenize_batch(texts: List[str]) -> List[List[str]]: Tokenize multiple textstokenize_stream(file_path: str, chunk_size: int = 8192) -> Iterator[List[str]]: Stream tokenize file
Vocabulary Management
set_vocabulary(vocabulary: Set[str]): Set custom vocabularyadd_to_vocabulary(tokens: Union[str, List[str], Set[str]]): Add tokens to vocabularyremove_from_vocabulary(tokens: Union[str, List[str], Set[str]]): Remove tokens from vocabularyclear_vocabulary(): Clear vocabulary filterget_vocabulary() -> Optional[Set[str]]: Get current vocabularysave_vocabulary(file_path: str): Save vocabulary to fileload_vocabulary(file_path: str): Load vocabulary from file
Analysis & Statistics
get_stats(text: str) -> dict: Get tokenization statisticsget_token_frequencies(texts: List[str]) -> Dict[str, int]: Get token frequenciesget_most_common_tokens(texts: List[str], top_k: int = 10) -> List[tuple]: Get most common tokens
⚡ Performance Philosophy
TurboTok is built around these core principles:
- NumPy Vectorization: Leverage SIMD operations and C-level speed
- Memory Efficiency: Use memory views and pre-allocation
- Minimal Python Loops: Avoid slow Python iteration
- Optimized Regex: Pre-compiled patterns with atomic groups
- Batch Processing: Process multiple texts efficiently
📊 Benchmarks
Performance Targets vs Actual Results
| Mode | Target | Actual | Performance |
|---|---|---|---|
| Byte | 5-10M tokens/sec | 100M+ tokens/sec | 15x faster |
| Char | 3-5M tokens/sec | 95M+ tokens/sec | 24x faster |
| Word | 2-4M tokens/sec | 2.8M tokens/sec | Meets target |
| Sentence | 1-2M tokens/sec | 800K tokens/sec | Good baseline |
Run Your Own Benchmarks
from turbotok.benchmarks import run_benchmarks
# Run comprehensive benchmarks
results = run_benchmarks(text_size_mb=1.0, iterations=30)
🧪 Testing
Run the comprehensive test suite:
python -m pytest tests/
Or run tests with performance benchmarks:
python tests/test_core.py
📚 Examples
Check out the examples/ directory for detailed usage examples:
quickstart.py: Comprehensive feature demonstration- Advanced usage patterns and best practices
🤝 Contributing
We welcome contributions! Please see our contributing guidelines for details.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built with NumPy for exceptional performance
- Inspired by modern tokenizer libraries
- Designed for high-throughput NLP applications
TurboTok: Where speed meets simplicity! 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turbotok-0.2.0.tar.gz.
File metadata
- Download URL: turbotok-0.2.0.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64695774f6d37d96e64a98d5ed5813b0933e133d254b0494fbfc9fd7740667c6
|
|
| MD5 |
38e753153b695dda0fe1f83162c35276
|
|
| BLAKE2b-256 |
43e147f14a1a843a50e859e077558236e7ee8e6e78fa315c7f7e8a7a7045490e
|
File details
Details for the file turbotok-0.2.0-py3-none-any.whl.
File metadata
- Download URL: turbotok-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10c96adabce87e0170cb9041082091cd946270c5a1607acd143b0179809bf810
|
|
| MD5 |
efebe87b20f991555fc5384441503a13
|
|
| BLAKE2b-256 |
f6a88daef0a34bb4bed1a4ced0febd214584d306f4d2285363faf7db66425f2e
|