Skip to main content

High-performance NumPy-based tokenizer library

Project description

TurboTok 🚀

High-performance NumPy-based tokenizer library with advanced features

PyPI version License: MIT

TurboTok is a blazingly fast tokenizer library that leverages NumPy's vectorization capabilities to achieve exceptional performance. Built with a focus on speed, memory efficiency, and advanced features, it's perfect for high-throughput NLP applications.

✨ Features

🚀 Core Tokenization Modes

  • Byte Mode: Raw byte-level tokenization (fastest)
  • Char Mode: Unicode character-level tokenization
  • Word Mode: Word-level tokenization with regex
  • Sentence Mode: Sentence-level tokenization with rule-based splitting

🎯 Advanced Features

  • Custom Vocabulary Support: Filter tokens based on custom vocabularies
  • Subword Tokenization: BPE and WordPiece-style tokenization
  • Streaming Tokenization: Process large files without loading into memory
  • Batch Processing: Ultra-efficient batch tokenization
  • Comprehensive Error Handling: Detailed error messages and validation
  • Token Statistics: Rich analytics and frequency analysis
  • Vocabulary Management: Save/load vocabularies to/from files

Performance Highlights

  • Byte Mode: 100M+ tokens/sec (15x faster than target!)
  • Char Mode: 95M+ tokens/sec (24x faster than target!)
  • Word Mode: 2.8M+ tokens/sec (meets target)
  • Sentence Mode: 800K+ tokens/sec (good baseline)

🛠️ Installation

pip install turbotok

🚀 Quick Start

Basic Usage

import turbotok

# Create tokenizer
tok = turbotok.TurboTok(mode="word")

# Tokenize text
tokens = tok.tokenize("Hello world! 🚀")
print(tokens)  # ['Hello', 'world', '!', '🚀']

All Tokenization Modes

text = "Hello world! This is TurboTok. 🚀"

# Byte mode (fastest)
tok_byte = turbotok.TurboTok(mode="byte")
byte_tokens = tok_byte.tokenize(text)  # [72, 101, 108, 108, 111, ...]

# Char mode (Unicode-safe)
tok_char = turbotok.TurboTok(mode="char")
char_tokens = tok_char.tokenize(text)  # ['H', 'e', 'l', 'l', 'o', ...]

# Word mode (default)
tok_word = turbotok.TurboTok(mode="word")
word_tokens = tok_word.tokenize(text)  # ['Hello', 'world', '!', 'This', ...]

# Sentence mode
tok_sentence = turbotok.TurboTok(mode="sentence")
sentence_tokens = tok_sentence.tokenize(text)  # ['Hello world!', 'This is TurboTok.', '🚀']

🎯 Advanced Features

Custom Vocabulary Support

# Create tokenizer with custom vocabulary
vocab = {"Hello", "world", "TurboTok", "Python", "NumPy"}
tok = turbotok.TurboTok(mode="word", vocabulary=vocab)

# Only tokens in vocabulary are returned
tokens = tok.tokenize("Hello world! This is TurboTok.")
print(tokens)  # ['Hello', 'world', 'TurboTok']

# Add tokens dynamically
tok.add_to_vocabulary(["amazing", "performance"])
tok.remove_from_vocabulary("Hello")

# Clear vocabulary
tok.clear_vocabulary()

Subword Tokenization

# BPE-style subword tokenization
tok_bpe = turbotok.TurboTok(mode="word", subword_mode="bpe", max_subword_length=3)
tokens = tok_bpe.tokenize("supercalifragilisticexpialidocious")
print(tokens)  # ['sup', 'erc', 'ali', 'fra', 'gil', ...]

# WordPiece-style subword tokenization
tok_wp = turbotok.TurboTok(mode="word", subword_mode="wordpiece", max_subword_length=4)
tokens = tok_wp.tokenize("internationalization")
print(tokens)  # ['inte', 'rnat', 'iona', 'liza', 'tion']

Streaming Tokenization

# Stream tokenize large files
tok = turbotok.TurboTok(mode="sentence")

for tokens in tok.tokenize_stream("large_file.txt", chunk_size=8192):
    # Process each chunk of tokens
    print(f"Processed {len(tokens)} tokens")

Batch Processing

# Ultra-efficient batch tokenization
texts = [
    "Hello world!",
    "Machine learning is amazing!",
    "Python programming with NumPy.",
    "Natural language processing."
]

tok = turbotok.TurboTok(mode="word")
batch_tokens = tok.tokenize_batch(texts)

for i, tokens in enumerate(batch_tokens):
    print(f"Text {i+1}: {tokens}")

Token Statistics & Analysis

tok = turbotok.TurboTok(mode="word")

# Get comprehensive statistics
stats = tok.get_stats("Hello world! This is TurboTok. 🚀")
print(stats)
# {
#     'mode': 'word',
#     'token_count': 8,
#     'avg_token_length': 4.25,
#     'max_token_length': 7,
#     'min_token_length': 1,
#     'text_length': 34,
#     'compression_ratio': 4.25,
#     'vocabulary_size': None,
#     'subword_mode': None
# }

# Token frequency analysis
texts = ["Hello world!", "Hello Python!", "Hello TurboTok!"]
frequencies = tok.get_token_frequencies(texts)
most_common = tok.get_most_common_tokens(texts, top_k=3)
print(most_common)  # [('Hello', 3), ('world', 1), ('Python', 1)]

Vocabulary Management

tok = turbotok.TurboTok(mode="word")

# Build vocabulary from texts
texts = ["Hello world!", "Machine learning!", "Python programming!"]
frequencies = tok.get_token_frequencies(texts)
tok.add_to_vocabulary(frequencies.keys())

# Save vocabulary to file
tok.save_vocabulary("my_vocab.txt")

# Load vocabulary in new tokenizer
new_tok = turbotok.TurboTok(mode="word")
new_tok.load_vocabulary("my_vocab.txt")

🔧 API Reference

TurboTok Class

Constructor

TurboTok(
    mode="word",                    # Tokenization mode
    vocabulary=None,                # Custom vocabulary set
    subword_mode=None,              # Subword mode ('bpe', 'wordpiece')
    max_subword_length=4            # Max subword length
)

Methods

Core Tokenization

  • tokenize(text: str) -> List[str]: Tokenize single text
  • tokenize_batch(texts: List[str]) -> List[List[str]]: Tokenize multiple texts
  • tokenize_stream(file_path: str, chunk_size: int = 8192) -> Iterator[List[str]]: Stream tokenize file

Vocabulary Management

  • set_vocabulary(vocabulary: Set[str]): Set custom vocabulary
  • add_to_vocabulary(tokens: Union[str, List[str], Set[str]]): Add tokens to vocabulary
  • remove_from_vocabulary(tokens: Union[str, List[str], Set[str]]): Remove tokens from vocabulary
  • clear_vocabulary(): Clear vocabulary filter
  • get_vocabulary() -> Optional[Set[str]]: Get current vocabulary
  • save_vocabulary(file_path: str): Save vocabulary to file
  • load_vocabulary(file_path: str): Load vocabulary from file

Analysis & Statistics

  • get_stats(text: str) -> dict: Get tokenization statistics
  • get_token_frequencies(texts: List[str]) -> Dict[str, int]: Get token frequencies
  • get_most_common_tokens(texts: List[str], top_k: int = 10) -> List[tuple]: Get most common tokens

⚡ Performance Philosophy

TurboTok is built around these core principles:

  1. NumPy Vectorization: Leverage SIMD operations and C-level speed
  2. Memory Efficiency: Use memory views and pre-allocation
  3. Minimal Python Loops: Avoid slow Python iteration
  4. Optimized Regex: Pre-compiled patterns with atomic groups
  5. Batch Processing: Process multiple texts efficiently

📊 Benchmarks

Performance Targets vs Actual Results

Mode Target Actual Performance
Byte 5-10M tokens/sec 100M+ tokens/sec 15x faster
Char 3-5M tokens/sec 95M+ tokens/sec 24x faster
Word 2-4M tokens/sec 2.8M tokens/sec Meets target
Sentence 1-2M tokens/sec 800K tokens/sec Good baseline

Run Your Own Benchmarks

from turbotok.benchmarks import run_benchmarks

# Run comprehensive benchmarks
results = run_benchmarks(text_size_mb=1.0, iterations=30)

🧪 Testing

Run the comprehensive test suite:

python -m pytest tests/

Or run tests with performance benchmarks:

python tests/test_core.py

📚 Examples

Check out the examples/ directory for detailed usage examples:

  • quickstart.py: Comprehensive feature demonstration
  • Advanced usage patterns and best practices

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with NumPy for exceptional performance
  • Inspired by modern tokenizer libraries
  • Designed for high-throughput NLP applications

TurboTok: Where speed meets simplicity! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbotok-0.2.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbotok-0.2.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file turbotok-0.2.0.tar.gz.

File metadata

  • Download URL: turbotok-0.2.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.2.0.tar.gz
Algorithm Hash digest
SHA256 64695774f6d37d96e64a98d5ed5813b0933e133d254b0494fbfc9fd7740667c6
MD5 38e753153b695dda0fe1f83162c35276
BLAKE2b-256 43e147f14a1a843a50e859e077558236e7ee8e6e78fa315c7f7e8a7a7045490e

See more details on using hashes here.

File details

Details for the file turbotok-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: turbotok-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for turbotok-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 10c96adabce87e0170cb9041082091cd946270c5a1607acd143b0179809bf810
MD5 efebe87b20f991555fc5384441503a13
BLAKE2b-256 f6a88daef0a34bb4bed1a4ced0febd214584d306f4d2285363faf7db66425f2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page