Project description

Tokenize2

Tokenize2 is an improved byte-level BPE tokenizer, inspired by models like GPT-3, designed for efficient tokenization of text into subword units. It supports special tokens and byte-level text handling for robust tokenization, including for non-ASCII characters.

Features

Byte-level tokenization for handling a wide range of characters
Special tokens (like <PAD>, <UNK>) for flexible token management
Supports efficient BPE merges for subword tokenization
Suitable for natural language processing and text generation tasks

Installation

You can install Tokenize2 via pip:

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.3

Oct 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenize2-2.0.3.tar.gz (3.9 kB view details)

Uploaded Oct 16, 2024 Source

Built Distribution

Tokenize2-2.0.3-py3-none-any.whl (4.2 kB view details)

Uploaded Oct 16, 2024 Python 3

File details

Details for the file tokenize2-2.0.3.tar.gz.

File metadata

Download URL: tokenize2-2.0.3.tar.gz
Upload date: Oct 16, 2024
Size: 3.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for tokenize2-2.0.3.tar.gz
Algorithm	Hash digest
SHA256	`61c5730bc4d897cb55dae1c9a00b05ffc66d7c794623f75f33d6e69f7c7a6d86`
MD5	`5ae284fa012d2385861c04a229e1eee5`
BLAKE2b-256	`231e19f970ac89f7b6dcbafa5aed933331fd2a57cb989c9ecc7f756f15d3df40`

See more details on using hashes here.

File details

Details for the file Tokenize2-2.0.3-py3-none-any.whl.

File metadata

Download URL: Tokenize2-2.0.3-py3-none-any.whl
Upload date: Oct 16, 2024
Size: 4.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for Tokenize2-2.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f58195d600e503d295bdbf650ce2b58cee4de0e7bd3ebc773ffa0350e492ecc`
MD5	`ca993e1526c2092b85dadef7a43745e3`
BLAKE2b-256	`b4a0dba0a280c36708e5c1870dc1d8cb6b8eccb31287a9fadbcbc4ccc73a1276`