A byte-level BPE tokenizer for efficient text processing
Project description
Tokenize2
Tokenize2 is an improved byte-level BPE tokenizer, inspired by models like GPT-3, designed for efficient tokenization of text into subword units. It supports special tokens and byte-level text handling for robust tokenization, including for non-ASCII characters.
Features
- Byte-level tokenization for handling a wide range of characters
- Special tokens (like
<PAD>
,<UNK>
) for flexible token management - Supports efficient BPE merges for subword tokenization
- Suitable for natural language processing and text generation tasks
Installation
You can install Tokenize2 via pip:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenize2-2.0.3.tar.gz
(3.9 kB
view details)
Built Distribution
File details
Details for the file tokenize2-2.0.3.tar.gz
.
File metadata
- Download URL: tokenize2-2.0.3.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61c5730bc4d897cb55dae1c9a00b05ffc66d7c794623f75f33d6e69f7c7a6d86 |
|
MD5 | 5ae284fa012d2385861c04a229e1eee5 |
|
BLAKE2b-256 | 231e19f970ac89f7b6dcbafa5aed933331fd2a57cb989c9ecc7f756f15d3df40 |
File details
Details for the file Tokenize2-2.0.3-py3-none-any.whl
.
File metadata
- Download URL: Tokenize2-2.0.3-py3-none-any.whl
- Upload date:
- Size: 4.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f58195d600e503d295bdbf650ce2b58cee4de0e7bd3ebc773ffa0350e492ecc |
|
MD5 | ca993e1526c2092b85dadef7a43745e3 |
|
BLAKE2b-256 | b4a0dba0a280c36708e5c1870dc1d8cb6b8eccb31287a9fadbcbc4ccc73a1276 |