Skip to main content

BPE & Unigram trainers for Shredword tokenizer

Project description

ShredWord

ShredWord is a byte-pair encoding (BPE) based tokenizer-trainer designed for fast, efficient, and flexible text processing & vocab training. It offers training, and text normalization functionalities and is backed by a C/C++ core with a Python interface for easy integration into machine learning workflows.

Unigram code doesn't work, I lack intelligence capabilites for fixing it.

Features

  1. Efficient Tokenization: Utilizes BPE for compressing text data and reducing the vocabulary size, making it well-suited for NLP tasks.
  2. Customizable Vocabulary: Allows users to define the target vocabulary size during training.
  3. Save and Load Models: Supports saving and loading trained tokenizers for reuse.
  4. Python Integration: Provides a Python interface for seamless integration and usability.

How It Works

Byte-Pair Encoding (BPE)

BPE is a subword tokenization algorithm that compresses a dataset by merging the most frequent pairs of characters or subwords into new tokens. This process continues until a predefined vocabulary size is reached.

Key steps:

  1. Initialize the vocabulary with all unique characters in the dataset.
  2. Count the frequency of character pairs.
  3. Merge the most frequent pair into a new token.
  4. Repeat until the target vocabulary size is achieved.

ShredWord implements this process efficiently in C/C++, exposing training, encoding, and decoding methods through Python.

Installation

Prerequisites

  • Python 3.11+
  • GCC or a compatible compiler (for compiling the C/C++ code)

Steps

  1. Install the Python package from PyPI.org:

    pip install shredword-trainer
    

Usage

Below is a simple example demonstrating how to use ShredWord for training, encoding, and decoding text.

Example

from shredword.trainer import BPETrainer

trainer = BPETrainer(target_vocab_size=500, min_pair_freq=1000)
trainer.load_corpus("test data/final.txt")
trainer.train()
trainer.save("model/merges_1k.model", "model/vocab_1k.vocab")

API Overview

Core Methods

  • train(text, vocab_size): Train a tokenizer on the input text to a specified vocabulary size.
  • save(file_path): Save the trained tokenizer to a file.

Properties

  • merges: View or set the merge rules for tokenization.
  • vocab: Access the vocabulary as a dictionary of token IDs to strings.
  • pattern: View or set the regular expression pattern used for token splitting.
  • special_tokens: View or set special tokens used by the tokenizer.

Advanced Features

Saving and Loading

Trained tokenizers can be saved to a file and reloaded for use in future tasks. The saved model includes merge rules and any special tokens or patterns defined during training.

# Save the trained model
tokenizer.save("vocab/trained_vocab.model")

# Load the model
tokenizer.load("vocab/trained_vocab.model")

Customization

Users can define special tokens or modify the merge rules and pattern directly using the provided properties.

# Set special tokens
special_tokens = [("<PAD>", 0), ("<UNK>", 1)]
tokenizer.special_tokens = special_tokens

# Update merge rules
merges = [(101, 32, 256), (32, 116, 257)]
tokenizer.merges = merges

a project by Shivendra

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword_trainer-0.0.2.tar.gz (44.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shredword_trainer-0.0.2-cp313-cp313-win_amd64.whl (61.0 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file shredword_trainer-0.0.2.tar.gz.

File metadata

  • Download URL: shredword_trainer-0.0.2.tar.gz
  • Upload date:
  • Size: 44.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword_trainer-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c06d6589c29364c9c4aff2169cad229412b7bfa9cb5a2b64096114131bc27337
MD5 e32b6e537787cca6fef8dc43853b8fcf
BLAKE2b-256 1e7bade94183ffbd3788d7339aa03b647516defff3493f6f5a5aa77f8fb3dd4a

See more details on using hashes here.

File details

Details for the file shredword_trainer-0.0.2-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for shredword_trainer-0.0.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 86cd439b3391c80b825e151b32cf244c0c3e1b827976ff3cdd7bc0007c8c28a7
MD5 9f8d6420cd0254ca2646d5b601dae7be
BLAKE2b-256 d1df59c4f3d1a3919d463f18d36a417d47ec9ad161892a4f2f42e2cff037a2d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page