A byte-level BPE tokenizer for efficient text processing
Project description
Tokenize2
Tokenize2 is an improved byte-level BPE tokenizer, inspired by models like GPT-3, designed for efficient tokenization of text into subword units. It supports special tokens and byte-level text handling for robust tokenization, including for non-ASCII characters.
Features
- Byte-level tokenization for handling a wide range of characters
- Special tokens (like
<PAD>
,<UNK>
) for flexible token management - Supports efficient BPE merges for subword tokenization
- Suitable for natural language processing and text generation tasks
Installation
You can install Tokenize2 via pip:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenize2-2.0.3.tar.gz
(3.9 kB
view hashes)
Built Distribution
Close
Hashes for Tokenize2-2.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f58195d600e503d295bdbf650ce2b58cee4de0e7bd3ebc773ffa0350e492ecc |
|
MD5 | ca993e1526c2092b85dadef7a43745e3 |
|
BLAKE2b-256 | b4a0dba0a280c36708e5c1870dc1d8cb6b8eccb31287a9fadbcbc4ccc73a1276 |