Skip to main content

A byte-level BPE tokenizer for efficient text processing

Project description

Tokenize2

Tokenize2 is an improved byte-level BPE tokenizer, inspired by models like GPT-3, designed for efficient tokenization of text into subword units. It supports special tokens and byte-level text handling for robust tokenization, including for non-ASCII characters.

Features

  • Byte-level tokenization for handling a wide range of characters
  • Special tokens (like <PAD>, <UNK>) for flexible token management
  • Supports efficient BPE merges for subword tokenization
  • Suitable for natural language processing and text generation tasks

Installation

You can install Tokenize2 via pip:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenize2-2.0.3.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

Tokenize2-2.0.3-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file tokenize2-2.0.3.tar.gz.

File metadata

  • Download URL: tokenize2-2.0.3.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for tokenize2-2.0.3.tar.gz
Algorithm Hash digest
SHA256 61c5730bc4d897cb55dae1c9a00b05ffc66d7c794623f75f33d6e69f7c7a6d86
MD5 5ae284fa012d2385861c04a229e1eee5
BLAKE2b-256 231e19f970ac89f7b6dcbafa5aed933331fd2a57cb989c9ecc7f756f15d3df40

See more details on using hashes here.

File details

Details for the file Tokenize2-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: Tokenize2-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for Tokenize2-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6f58195d600e503d295bdbf650ce2b58cee4de0e7bd3ebc773ffa0350e492ecc
MD5 ca993e1526c2092b85dadef7a43745e3
BLAKE2b-256 b4a0dba0a280c36708e5c1870dc1d8cb6b8eccb31287a9fadbcbc4ccc73a1276

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page