High-performance NumPy-based tokenizer library

These details have not been verified by PyPI

Project links

Project description

TurboTok 🚀

High-performance NumPy-based tokenizer library

TurboTok is a blazingly fast tokenizer built with pure Python + NumPy vectorization. It exploits SIMD operations under the hood and minimizes Python loops for maximum performance.

Features

⚡ Ultra-fast: 1-10M tokens/sec depending on mode
🧠 NumPy vectorization: SIMD operations for maximum speed
🎯 Multiple modes: byte, char, word, and sentence tokenization
🐍 Pure Python: No external dependencies beyond NumPy
🌍 Unicode support: Full Unicode character handling
📦 Batch processing: Efficient tokenization of multiple texts
📊 Performance stats: Built-in benchmarking and statistics

Installation

pip install turbotok

For development:

pip install turbotok[dev]

Quick Start

import turbotok

# Create tokenizer with desired mode
tok = turbotok.TurboTok(mode="word")

# Tokenize text
tokens = tok.tokenize("Hello world! 🚀 TurboTok is blazingly fast!")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀', 'TurboTok', 'is', 'blazingly', 'fast', '!']

# Get statistics
stats = tok.get_stats("Hello world! 🚀")
print(stats)
# Output: {'mode': 'word', 'token_count': 5, 'avg_token_length': 3.2, ...}

Tokenization Modes

1. Byte Mode (Fastest)

Raw byte-level tokenization using NumPy vectorization:

tok = turbotok.TurboTok(mode="byte")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: [72, 101, 108, 108, 111, 33, 32, 240, 159, 154, 128]
# Performance: 5-10M tokens/sec

2. Char Mode

Unicode character-level tokenization:

tok = turbotok.TurboTok(mode="char")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: ['H', 'e', 'l', 'l', 'o', '!', ' ', '🚀']
# Performance: 3-5M tokens/sec

3. Word Mode (Default)

Word-level tokenization with regex:

tok = turbotok.TurboTok(mode="word")
tokens = tok.tokenize("Hello world! 🚀")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀']
# Performance: 2-4M tokens/sec

4. Sentence Mode

Sentence-level tokenization:

tok = turbotok.TurboTok(mode="sentence")
tokens = tok.tokenize("Hello world! How are you? I am fine.")
print(tokens)
# Output: ['Hello world!', 'How are you?', 'I am fine.']
# Performance: 1-2M tokens/sec

Batch Processing

Tokenize multiple texts efficiently:

texts = ["Hello world!", "TurboTok 🚀 rocks!", "Fast tokenization!"]
tok = turbotok.TurboTok(mode="word")

# Batch tokenization
batch_tokens = tok.tokenize_batch(texts)
print(batch_tokens)
# Output: [['Hello', 'world', '!'], ['TurboTok', '🚀', 'rocks', '!'], ['Fast', 'tokenization', '!']]

Performance Benchmarks

Run the built-in benchmark suite:

from turbotok.benchmarks import run_benchmarks

results = run_benchmarks(text_size_mb=1.0, iterations=50)

Target Performance Goals

Mode	Target (tokens/sec)	Status
Byte	5-10M	🚀
Char	3-5M	🚀
Word	2-4M	🚀
Sentence	1-2M	🚀

API Reference

TurboTok Class

Constructor

TurboTok(mode: str = "word")

Parameters:

mode (str): Tokenization mode ("byte", "char", "word", "sentence")

Methods

`tokenize(text: str) -> List[Union[int, str]]`

Tokenize a single text string.

Parameters:

text (str): Input text to tokenize

Returns:

List[Union[int, str]]: List of tokens (bytes as ints for byte mode, strings otherwise)

`tokenize_batch(texts: List[str]) -> List[List[Union[int, str]]]`

Tokenize multiple texts efficiently.

Parameters:

texts (List[str]): List of input texts

Returns:

List[List[Union[int, str]]]: List of token lists

`get_stats(text: str) -> dict`

Get tokenization statistics.

Parameters:

text (str): Input text

Returns:

dict: Statistics including token count, average length, compression ratio, etc.

Performance Optimizations

1. NumPy Vectorization

Byte mode uses np.frombuffer() for C-level speed
No Python loops in critical paths
SIMD operations under the hood

2. Pre-compiled Regex

Word and sentence patterns compiled once at initialization
Avoids repeated regex compilation overhead

3. Memory Views

Uses np.frombuffer instead of string iteration
Direct memory access for maximum performance

4. Batch Processing

Vectorized operations for multiple texts
Reduced function call overhead

Development

Running Tests

pytest tests/

Running Benchmarks

python -m turbotok.benchmarks

Code Quality

# Format code
black turbotok/ tests/

# Lint code
flake8 turbotok/ tests/

# Type checking
mypy turbotok/

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

License

MIT License - see LICENSE file for details.

Performance Philosophy

TurboTok follows these performance principles:

Exploit NumPy vectorization - SIMD under the hood
Minimize Python loops - They kill speed
Use memory views - np.frombuffer, np.char ops
Apply math-like thinking - Treat text as arrays, not strings
Pre-compile patterns - Avoid repeated regex compilation
Batch operations - Process multiple texts efficiently

Roadmap

Parallel processing with multiprocessing
Numba JIT compilation for even more speed
Custom vocabulary support
Subword tokenization modes
Streaming tokenization for large files
Integration with popular NLP frameworks

TurboTok - Because speed matters! 🚀⚡

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Aug 17, 2025

This version

0.1.0

Aug 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbotok-0.1.0.tar.gz (14.3 kB view details)

Uploaded Aug 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbotok-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Aug 17, 2025 Python 3

File details

Details for the file turbotok-0.1.0.tar.gz.

File metadata

Download URL: turbotok-0.1.0.tar.gz
Upload date: Aug 17, 2025
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0fc413fab1c90bf7a194a42429fabd954148fd1619a301977a74764738c9ddde`
MD5	`b96d9341a6e14badc997b0739a92c05b`
BLAKE2b-256	`e5033d20e0dd7ee2562be38dd18eaeda98ce1c6e71e50001b812cae211322364`

See more details on using hashes here.

File details

Details for the file turbotok-0.1.0-py3-none-any.whl.

File metadata

Download URL: turbotok-0.1.0-py3-none-any.whl
Upload date: Aug 17, 2025
Size: 9.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22a7196bf6cb7314149e3beff8c1a14a1512836b42cfd50e3c00b6bb76274226`
MD5	`91e89643fc2187f6d207a1154dc86d31`
BLAKE2b-256	`2aad7efc4672d7cd3436d6ce8fd23a7249388a8b3821d99c7e0c8c5c6207234a`

See more details on using hashes here.

turbotok 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboTok 🚀

Features

Installation

Quick Start

Tokenization Modes

1. Byte Mode (Fastest)

2. Char Mode

3. Word Mode (Default)

4. Sentence Mode

Batch Processing

Performance Benchmarks

Target Performance Goals

API Reference

TurboTok Class

Constructor

Methods

tokenize(text: str) -> List[Union[int, str]]

tokenize_batch(texts: List[str]) -> List[List[Union[int, str]]]

get_stats(text: str) -> dict

Performance Optimizations

1. NumPy Vectorization

2. Pre-compiled Regex

3. Memory Views

4. Batch Processing

Development

Running Tests

Running Benchmarks

Code Quality

Contributing

License

Performance Philosophy

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tokenize(text: str) -> List[Union[int, str]]`

`tokenize_batch(texts: List[str]) -> List[List[Union[int, str]]]`

`get_stats(text: str) -> dict`