Skip to main content

A byte-level BPE tokenizer for efficient text processing

Project description

Tokenize2

Tokenize2 is an improved byte-level BPE tokenizer, inspired by models like GPT-3, designed for efficient tokenization of text into subword units. It supports special tokens and byte-level text handling for robust tokenization, including for non-ASCII characters.

Features

  • Byte-level tokenization for handling a wide range of characters
  • Special tokens (like <PAD>, <UNK>) for flexible token management
  • Supports efficient BPE merges for subword tokenization
  • Suitable for natural language processing and text generation tasks

Installation

You can install Tokenize2 via pip:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenize2-2.0.3.tar.gz (3.9 kB view hashes)

Uploaded Source

Built Distribution

Tokenize2-2.0.3-py3-none-any.whl (4.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page