Skip to main content

Inclitoken is implementation of Byte Pair Encoding Tokenizer from scratch.

Project description

IncliToken

A simple Byte Pair Encoding (BPE) tokenizer implementation from scratch in Python.

Installation

uv add inclitoken

Or you can use pip:

pip install inclitoken

Usage

from inclitoken.tokenizer import BPETokenizer

# Initialize tokenizer
tokenizer = BPETokenizer()

# Train on your text
text = "Hello world! This is a simple example."
tokenizer.train(text, turns=100, verbose=False)

# Encode text to token IDs
ids = tokenizer.encode("Hello world!")
print(ids)

# Decode token IDs back to text
decoded = tokenizer.decode(ids)
print(decoded)

Features

  • Train custom BPE tokenizers on your text
  • Encode text into token IDs
  • Decode token IDs back into text
  • Track merge operations and vocabulary

Requirements

  • Python >= 3.14
  • tqdm

Author

Built by Adarsh Dubey

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inclitoken-0.1.0.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inclitoken-0.1.0-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file inclitoken-0.1.0.tar.gz.

File metadata

  • Download URL: inclitoken-0.1.0.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.7

File hashes

Hashes for inclitoken-0.1.0.tar.gz
Algorithm Hash digest
SHA256 044b61399c8c727726dc4e01c4b8a6cfe3a3dc9c95ec9bb582a9892c5842c070
MD5 96d37070845f8965826c12bfde5d8639
BLAKE2b-256 c659591c182d271f177727ef38993abbf6cfbe9bb0c12eac6d61dfae02c68582

See more details on using hashes here.

File details

Details for the file inclitoken-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for inclitoken-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 930b64db0ebe9087fd8d6713524de46b6e243c5f79cc1e6b51862a648ed45144
MD5 d890dca4085294fa73a0eb2ed8768709
BLAKE2b-256 c2fec9242cfe53715ea14a8dac9e6e53723143f0086633bb7803ff2fa9d52e7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page