No project description provided
Project description
tiktoken-chatml
Adding support for ChatML chat template to tiktoken tokenizers:
- Remap or remove OpenAI special tokens to support only ChatML special tokens:
<|im_start|>
,<|im_end|>
; - Always maintain the original vocuabulary size if possible;
- Add
apply_chat_template
method known from HF tokenizers; - Maintain full functionality of tiktoken tokenizer.
Use for training models from scratch. For your model safety - recheck all changes before using.
Installation
pip install tiktoken-chatml
Quickstart
import tiktoken_chatml
enc = tiktoken_chatml.get_encoding("cl100k_base-chatml")
output = enc.apply_chat_template(
[
{"role": "system", "content": "This is a system message."},
{"role": "user", "content": "Hello!"},
],
tokenize=False,
)
print(output)
Output:
<|im_start|>system
This is a system message
<|im_end|>
<|im_start|>user
Hello!
<|im_end|>
Setting tokenize=True
invokes tiktoken encoding.encode()
.
You can use this encoding as a drag and drop replacement for tiktoken.
Supported encodings:
SUPPORTED_ENCODINGS = ["o200k_base-chatml", "cl100k_base-chatml", "gpt2-chatml"]
The eot_token
is now <|im_end|>
:
>> enc.eot_token == enc.encode("<|im_end|>", allowed_special="all")[0]
True
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tiktoken_chatml-0.1.0.tar.gz
(2.2 kB
view details)
Built Distribution
File details
Details for the file tiktoken_chatml-0.1.0.tar.gz
.
File metadata
- Download URL: tiktoken_chatml-0.1.0.tar.gz
- Upload date:
- Size: 2.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.13 Linux/5.15.0-1057-aws
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90f0500160c35ff70417ae494748ea902e8bc0eb74ed7e60a00a496d0410a730 |
|
MD5 | 2b44b01816be8f9e73f5d323eb0cb477 |
|
BLAKE2b-256 | 000a253aea54cecb1e288298a992e08be7aeb81cf3347d9f2e7ec45f816f0cd2 |
File details
Details for the file tiktoken_chatml-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: tiktoken_chatml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.13 Linux/5.15.0-1057-aws
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 904575ef86313f7b5cc320a869acb51a48f1f9b8d803f3d4edd668a603130b5e |
|
MD5 | 60698385d996e68d1e0752e80a3bbac6 |
|
BLAKE2b-256 | c31fce3b50bf1789aebb79234c24fdb2e6e1f131acd61575bb3718d523371aab |