High-performance NumPy-based tokenizer library
Project description
TurboTok 🚀
High-performance NumPy-based tokenizer library
TurboTok is a blazingly fast tokenizer built with pure Python + NumPy vectorization. It exploits SIMD operations under the hood and minimizes Python loops for maximum performance.
Features
- ⚡ Ultra-fast: 1-10M tokens/sec depending on mode
- 🧠 NumPy vectorization: SIMD operations for maximum speed
- 🎯 Multiple modes: byte, char, word, and sentence tokenization
- 🐍 Pure Python: No external dependencies beyond NumPy
- 🌍 Unicode support: Full Unicode character handling
- 📦 Batch processing: Efficient tokenization of multiple texts
- 📊 Performance stats: Built-in benchmarking and statistics
Installation
pip install turbotok
For development:
pip install turbotok[dev]
Quick Start
import turbotok
# Create tokenizer with desired mode
tok = turbotok.TurboTok(mode="word")
# Tokenize text
tokens = tok.tokenize("Hello world! 🚀 TurboTok is blazingly fast!")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀', 'TurboTok', 'is', 'blazingly', 'fast', '!']
# Get statistics
stats = tok.get_stats("Hello world! 🚀")
print(stats)
# Output: {'mode': 'word', 'token_count': 5, 'avg_token_length': 3.2, ...}
Tokenization Modes
1. Byte Mode (Fastest)
Raw byte-level tokenization using NumPy vectorization:
tok = turbotok.TurboTok(mode="byte")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: [72, 101, 108, 108, 111, 33, 32, 240, 159, 154, 128]
# Performance: 5-10M tokens/sec
2. Char Mode
Unicode character-level tokenization:
tok = turbotok.TurboTok(mode="char")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: ['H', 'e', 'l', 'l', 'o', '!', ' ', '🚀']
# Performance: 3-5M tokens/sec
3. Word Mode (Default)
Word-level tokenization with regex:
tok = turbotok.TurboTok(mode="word")
tokens = tok.tokenize("Hello world! 🚀")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀']
# Performance: 2-4M tokens/sec
4. Sentence Mode
Sentence-level tokenization:
tok = turbotok.TurboTok(mode="sentence")
tokens = tok.tokenize("Hello world! How are you? I am fine.")
print(tokens)
# Output: ['Hello world!', 'How are you?', 'I am fine.']
# Performance: 1-2M tokens/sec
Batch Processing
Tokenize multiple texts efficiently:
texts = ["Hello world!", "TurboTok 🚀 rocks!", "Fast tokenization!"]
tok = turbotok.TurboTok(mode="word")
# Batch tokenization
batch_tokens = tok.tokenize_batch(texts)
print(batch_tokens)
# Output: [['Hello', 'world', '!'], ['TurboTok', '🚀', 'rocks', '!'], ['Fast', 'tokenization', '!']]
Performance Benchmarks
Run the built-in benchmark suite:
from turbotok.benchmarks import run_benchmarks
results = run_benchmarks(text_size_mb=1.0, iterations=50)
Target Performance Goals
| Mode | Target (tokens/sec) | Status |
|---|---|---|
| Byte | 5-10M | 🚀 |
| Char | 3-5M | 🚀 |
| Word | 2-4M | 🚀 |
| Sentence | 1-2M | 🚀 |
API Reference
TurboTok Class
Constructor
TurboTok(mode: str = "word")
Parameters:
mode(str): Tokenization mode ("byte", "char", "word", "sentence")
Methods
tokenize(text: str) -> List[Union[int, str]]
Tokenize a single text string.
Parameters:
text(str): Input text to tokenize
Returns:
List[Union[int, str]]: List of tokens (bytes as ints for byte mode, strings otherwise)
tokenize_batch(texts: List[str]) -> List[List[Union[int, str]]]
Tokenize multiple texts efficiently.
Parameters:
texts(List[str]): List of input texts
Returns:
List[List[Union[int, str]]]: List of token lists
get_stats(text: str) -> dict
Get tokenization statistics.
Parameters:
text(str): Input text
Returns:
dict: Statistics including token count, average length, compression ratio, etc.
Performance Optimizations
1. NumPy Vectorization
- Byte mode uses
np.frombuffer()for C-level speed - No Python loops in critical paths
- SIMD operations under the hood
2. Pre-compiled Regex
- Word and sentence patterns compiled once at initialization
- Avoids repeated regex compilation overhead
3. Memory Views
- Uses
np.frombufferinstead of string iteration - Direct memory access for maximum performance
4. Batch Processing
- Vectorized operations for multiple texts
- Reduced function call overhead
Development
Running Tests
pytest tests/
Running Benchmarks
python -m turbotok.benchmarks
Code Quality
# Format code
black turbotok/ tests/
# Lint code
flake8 turbotok/ tests/
# Type checking
mypy turbotok/
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
License
MIT License - see LICENSE file for details.
Performance Philosophy
TurboTok follows these performance principles:
- Exploit NumPy vectorization - SIMD under the hood
- Minimize Python loops - They kill speed
- Use memory views -
np.frombuffer,np.charops - Apply math-like thinking - Treat text as arrays, not strings
- Pre-compile patterns - Avoid repeated regex compilation
- Batch operations - Process multiple texts efficiently
Roadmap
- Parallel processing with multiprocessing
- Numba JIT compilation for even more speed
- Custom vocabulary support
- Subword tokenization modes
- Streaming tokenization for large files
- Integration with popular NLP frameworks
TurboTok - Because speed matters! 🚀⚡
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turbotok-0.1.0.tar.gz.
File metadata
- Download URL: turbotok-0.1.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fc413fab1c90bf7a194a42429fabd954148fd1619a301977a74764738c9ddde
|
|
| MD5 |
b96d9341a6e14badc997b0739a92c05b
|
|
| BLAKE2b-256 |
e5033d20e0dd7ee2562be38dd18eaeda98ce1c6e71e50001b812cae211322364
|
File details
Details for the file turbotok-0.1.0-py3-none-any.whl.
File metadata
- Download URL: turbotok-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22a7196bf6cb7314149e3beff8c1a14a1512836b42cfd50e3c00b6bb76274226
|
|
| MD5 |
91e89643fc2187f6d207a1154dc86d31
|
|
| BLAKE2b-256 |
2aad7efc4672d7cd3436d6ce8fd23a7249388a8b3821d99c7e0c8c5c6207234a
|