Skip to main content

No project description provided

Project description

tiktoken-chatml

Adding support for ChatML chat template to tiktoken tokenizers:

  • Remap or remove OpenAI special tokens to support only ChatML special tokens: <|im_start|>, <|im_end|>;
  • Always maintain the original vocuabulary size if possible;
  • Add apply_chat_template method known from HF tokenizers;
  • Maintain full functionality of tiktoken tokenizer.

Use for training models from scratch. For your model safety - recheck all changes before using.

Installation

pip install tiktoken-chatml

Quickstart

import tiktoken_chatml

enc = tiktoken_chatml.get_encoding("cl100k_base-chatml")

output = enc.apply_chat_template(
    [
        {"role": "system", "content": "This is a system message."},
        {"role": "user", "content": "Hello!"},
    ],
    tokenize=False,
)
print(output)

Output:

<|im_start|>system
This is a system message
<|im_end|>
<|im_start|>user
Hello!
<|im_end|>

Setting tokenize=True invokes tiktoken encoding.encode().

You can use this encoding as a drag and drop replacement for tiktoken.

Supported encodings:

SUPPORTED_ENCODINGS = ["o200k_base-chatml", "cl100k_base-chatml", "gpt2-chatml"]

The eot_token is now <|im_end|>:

>> enc.eot_token == enc.encode("<|im_end|>", allowed_special="all")[0]
True

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiktoken_chatml-0.1.0.tar.gz (2.2 kB view details)

Uploaded Source

Built Distribution

tiktoken_chatml-0.1.0-py3-none-any.whl (2.8 kB view details)

Uploaded Python 3

File details

Details for the file tiktoken_chatml-0.1.0.tar.gz.

File metadata

  • Download URL: tiktoken_chatml-0.1.0.tar.gz
  • Upload date:
  • Size: 2.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.13 Linux/5.15.0-1057-aws

File hashes

Hashes for tiktoken_chatml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 90f0500160c35ff70417ae494748ea902e8bc0eb74ed7e60a00a496d0410a730
MD5 2b44b01816be8f9e73f5d323eb0cb477
BLAKE2b-256 000a253aea54cecb1e288298a992e08be7aeb81cf3347d9f2e7ec45f816f0cd2

See more details on using hashes here.

File details

Details for the file tiktoken_chatml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tiktoken_chatml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.13 Linux/5.15.0-1057-aws

File hashes

Hashes for tiktoken_chatml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 904575ef86313f7b5cc320a869acb51a48f1f9b8d803f3d4edd668a603130b5e
MD5 60698385d996e68d1e0752e80a3bbac6
BLAKE2b-256 c31fce3b50bf1789aebb79234c24fdb2e6e1f131acd61575bb3718d523371aab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page