Skip to main content

High-performance NumPy-based tokenizer library

Project description

TurboTok 🚀

High-performance NumPy-based tokenizer library

TurboTok is a blazingly fast tokenizer built with pure Python + NumPy vectorization. It exploits SIMD operations under the hood and minimizes Python loops for maximum performance.

Features

  • Ultra-fast: 1-10M tokens/sec depending on mode
  • 🧠 NumPy vectorization: SIMD operations for maximum speed
  • 🎯 Multiple modes: byte, char, word, and sentence tokenization
  • 🐍 Pure Python: No external dependencies beyond NumPy
  • 🌍 Unicode support: Full Unicode character handling
  • 📦 Batch processing: Efficient tokenization of multiple texts
  • 📊 Performance stats: Built-in benchmarking and statistics

Installation

pip install turbotok

For development:

pip install turbotok[dev]

Quick Start

import turbotok

# Create tokenizer with desired mode
tok = turbotok.TurboTok(mode="word")

# Tokenize text
tokens = tok.tokenize("Hello world! 🚀 TurboTok is blazingly fast!")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀', 'TurboTok', 'is', 'blazingly', 'fast', '!']

# Get statistics
stats = tok.get_stats("Hello world! 🚀")
print(stats)
# Output: {'mode': 'word', 'token_count': 5, 'avg_token_length': 3.2, ...}

Tokenization Modes

1. Byte Mode (Fastest)

Raw byte-level tokenization using NumPy vectorization:

tok = turbotok.TurboTok(mode="byte")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: [72, 101, 108, 108, 111, 33, 32, 240, 159, 154, 128]
# Performance: 5-10M tokens/sec

2. Char Mode

Unicode character-level tokenization:

tok = turbotok.TurboTok(mode="char")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: ['H', 'e', 'l', 'l', 'o', '!', ' ', '🚀']
# Performance: 3-5M tokens/sec

3. Word Mode (Default)

Word-level tokenization with regex:

tok = turbotok.TurboTok(mode="word")
tokens = tok.tokenize("Hello world! 🚀")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀']
# Performance: 2-4M tokens/sec

4. Sentence Mode

Sentence-level tokenization:

tok = turbotok.TurboTok(mode="sentence")
tokens = tok.tokenize("Hello world! How are you? I am fine.")
print(tokens)
# Output: ['Hello world!', 'How are you?', 'I am fine.']
# Performance: 1-2M tokens/sec

Batch Processing

Tokenize multiple texts efficiently:

texts = ["Hello world!", "TurboTok 🚀 rocks!", "Fast tokenization!"]
tok = turbotok.TurboTok(mode="word")

# Batch tokenization
batch_tokens = tok.tokenize_batch(texts)
print(batch_tokens)
# Output: [['Hello', 'world', '!'], ['TurboTok', '🚀', 'rocks', '!'], ['Fast', 'tokenization', '!']]

Performance Benchmarks

Run the built-in benchmark suite:

from turbotok.benchmarks import run_benchmarks

results = run_benchmarks(text_size_mb=1.0, iterations=50)

Target Performance Goals

Mode Target (tokens/sec) Status
Byte 5-10M 🚀
Char 3-5M 🚀
Word 2-4M 🚀
Sentence 1-2M 🚀

API Reference

TurboTok Class

Constructor

TurboTok(mode: str = "word")

Parameters:

  • mode (str): Tokenization mode ("byte", "char", "word", "sentence")

Methods

tokenize(text: str) -> List[Union[int, str]]

Tokenize a single text string.

Parameters:

  • text (str): Input text to tokenize

Returns:

  • List[Union[int, str]]: List of tokens (bytes as ints for byte mode, strings otherwise)
tokenize_batch(texts: List[str]) -> List[List[Union[int, str]]]

Tokenize multiple texts efficiently.

Parameters:

  • texts (List[str]): List of input texts

Returns:

  • List[List[Union[int, str]]]: List of token lists
get_stats(text: str) -> dict

Get tokenization statistics.

Parameters:

  • text (str): Input text

Returns:

  • dict: Statistics including token count, average length, compression ratio, etc.

Performance Optimizations

1. NumPy Vectorization

  • Byte mode uses np.frombuffer() for C-level speed
  • No Python loops in critical paths
  • SIMD operations under the hood

2. Pre-compiled Regex

  • Word and sentence patterns compiled once at initialization
  • Avoids repeated regex compilation overhead

3. Memory Views

  • Uses np.frombuffer instead of string iteration
  • Direct memory access for maximum performance

4. Batch Processing

  • Vectorized operations for multiple texts
  • Reduced function call overhead

Development

Running Tests

pytest tests/

Running Benchmarks

python -m turbotok.benchmarks

Code Quality

# Format code
black turbotok/ tests/

# Lint code
flake8 turbotok/ tests/

# Type checking
mypy turbotok/

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Performance Philosophy

TurboTok follows these performance principles:

  1. Exploit NumPy vectorization - SIMD under the hood
  2. Minimize Python loops - They kill speed
  3. Use memory views - np.frombuffer, np.char ops
  4. Apply math-like thinking - Treat text as arrays, not strings
  5. Pre-compile patterns - Avoid repeated regex compilation
  6. Batch operations - Process multiple texts efficiently

Roadmap

  • Parallel processing with multiprocessing
  • Numba JIT compilation for even more speed
  • Custom vocabulary support
  • Subword tokenization modes
  • Streaming tokenization for large files
  • Integration with popular NLP frameworks

TurboTok - Because speed matters! 🚀⚡

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbotok-0.1.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbotok-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file turbotok-0.1.0.tar.gz.

File metadata

  • Download URL: turbotok-0.1.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0fc413fab1c90bf7a194a42429fabd954148fd1619a301977a74764738c9ddde
MD5 b96d9341a6e14badc997b0739a92c05b
BLAKE2b-256 e5033d20e0dd7ee2562be38dd18eaeda98ce1c6e71e50001b812cae211322364

See more details on using hashes here.

File details

Details for the file turbotok-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: turbotok-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22a7196bf6cb7314149e3beff8c1a14a1512836b42cfd50e3c00b6bb76274226
MD5 91e89643fc2187f6d207a1154dc86d31
BLAKE2b-256 2aad7efc4672d7cd3436d6ce8fd23a7249388a8b3821d99c7e0c8c5c6207234a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page