tiktoken is a fast BPE tokeniser for use with OpenAI's models
Project description
⏳ tiktoken
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("text-davinci-003")
The open source version of tiktoken can be installed from PyPI:
pip install tiktoken
The tokeniser API is documented in tiktoken/core.py.
Example code using tiktoken can be found in the
OpenAI Cookbook.
Performance
tiktoken is between 3-6x faster than a comparable open source tokeniser:
Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from
tokenizers==0.13.2, transformers==4.24.0 and tiktoken==0.2.0.
Getting help
Please post questions in the issue tracker.
If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.
Extending tiktoken
You may wish to extend tiktoken to support new encodings. There are two ways to do this.
Create your Encoding object exactly the way you want and simply pass it around.
cl100k_base = tiktoken.get_encoding("cl100k_base")
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
Use the tiktoken_ext plugin mechanism to register your Encoding objects with tiktoken.
This is only useful if you need tiktoken.get_encoding to find your encoding, otherwise prefer
option 1.
To do this, you'll need to create a namespace package under tiktoken_ext.
Layout your project like this, making sure to omit the tiktoken_ext/__init__.py file:
my_tiktoken_extension
├── tiktoken_ext
│ └── my_encodings.py
└── setup.py
my_encodings.py should be a module that contains a variable named ENCODING_CONSTRUCTORS.
This is a dictionary from an encoding name to a function that takes no arguments and returns
arguments that can be passed to tiktoken.Encoding to construct that encoding. For an example, see
tiktoken_ext/openai_public.py. For precise details, see tiktoken/registry.py.
Your setup.py should look something like this:
from setuptools import setup, find_namespace_packages
setup(
name="my_tiktoken_extension",
packages=find_namespace_packages(include=['tiktoken_ext*']),
install_requires=["tiktoken"],
...
)
Then simply pip install ./my_tiktoken_extension and you should be able to use your
custom encodings! Make sure not to use an editable install.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tiktoken-0.3.0.tar.gz.
File metadata
- Download URL: tiktoken-0.3.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2476a4f4d29293762dc3320d50d866202d7e1c562ac378a785dde51057dcef5e
|
|
| MD5 |
f818e4fadc69abd524e73c74d6f347ef
|
|
| BLAKE2b-256 |
8d59dfafae6747926ac8200e303cd45bcf1c152ee569dfad64accb12ab7276e0
|
File details
Details for the file tiktoken-0.3.0-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 581.1 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff26fe25480a03fdd15de2dc6c33afab632d5c4deab33b054c42fa25fea98606
|
|
| MD5 |
66159816a09542866dac06b398fd1d51
|
|
| BLAKE2b-256 |
c503e863c4f47fd1defdf14feaf96c0bfcd587cf073c59b177de6bd3da2f5caf
|
File details
Details for the file tiktoken-0.3.0-cp311-cp311-musllinux_1_1_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp311-cp311-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.11, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38f7c2c790cbc8f9122c8f2bcd543d385b8e5557becade29fae6de5ddba74085
|
|
| MD5 |
735c592b86e2e56f36155e99419cb47d
|
|
| BLAKE2b-256 |
e43b7830a6b687df5f69106ca54eba3a64146f574e2f22946e5a2c985104a317
|
File details
Details for the file tiktoken-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1eaeeb52a79af618eac095ca91f11ba96d8b18a7e3019ecfaaaa2691838392ba
|
|
| MD5 |
468876a92a0d6abaa80a692dce4b7d0e
|
|
| BLAKE2b-256 |
5c996044c5197ee462ff2f698c3a9f5cc97956126aca53d15cc9b2fe565fa0c4
|
File details
Details for the file tiktoken-0.3.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 702.4 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
002429fcd9b004cb3b3e859c5ebe9ea8d916a51378128ebedd2bf3bf6320401a
|
|
| MD5 |
45605831bd1303aab2831c545b7b1b6e
|
|
| BLAKE2b-256 |
82d4be453d5110d84291b9e312264f3e5109712748a4f21ac8cea8cda79a791b
|
File details
Details for the file tiktoken-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl
- Upload date:
- Size: 735.2 kB
- Tags: CPython 3.11, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e04bdabb9ff19a8a0100dae665df8f23838593d6ab6490790fcfe199ee4a8e5
|
|
| MD5 |
5b343adaa84b01d1b13ac47f1051d475
|
|
| BLAKE2b-256 |
96b01241f4fc2c7b9827dba9506f73751104a252189c136b0e15cf595760c470
|
File details
Details for the file tiktoken-0.3.0-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 581.1 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c57a1a61167525f4eca8c7b09ebb20e85a77f6de913a07eb6547acf60e9dbe7f
|
|
| MD5 |
61a010dd9161259c6b5ebb396e55b78b
|
|
| BLAKE2b-256 |
7a270826742ce2d59bfcf1e8361300f735ae0cb96611751ff9fb4ba5eb691068
|
File details
Details for the file tiktoken-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.10, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a62a2a5d29bfe93170e59533951c687f37c90a2610ba780bcca5eec8729468c9
|
|
| MD5 |
19ae5833169b46b21477a461c8bab5be
|
|
| BLAKE2b-256 |
58d4095bc4f3586940019c524a9fbcfdee13cecf7d31584a6e29fdf0c13e096a
|
File details
Details for the file tiktoken-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98d91ea78a792c28664cbc5bee81440bb17393530279b71e3216de8a82253bc2
|
|
| MD5 |
892617dc1b648ab9b40db330a4c7c357
|
|
| BLAKE2b-256 |
353563acc50cf36ac0b77511ee8432ceadfc9e636d275cc9b2491eac9fb68a8a
|
File details
Details for the file tiktoken-0.3.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 702.4 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca6e972e74903c2dc36631f0061240972cdab99bd7b559555628a34e965484f2
|
|
| MD5 |
d4d89fc6adaabaa63fd24e82602c0117
|
|
| BLAKE2b-256 |
2b0b06f9ef591571d0b3a2e2881ae12a4507896dd0f23c275c5e1a92460698cb
|
File details
Details for the file tiktoken-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 735.2 kB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15232605e9bc7c2dfa6f67f5608389f8caabca03ef577e5d01bc1c9c5c90e9df
|
|
| MD5 |
848cfd607311edf97377cd98cc2039d1
|
|
| BLAKE2b-256 |
9db0610213638cfeead02f218ebe41d2b3e4c420b8b09b2d0bed7223917d2442
|
File details
Details for the file tiktoken-0.3.0-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 581.4 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77650f9b4584fc26ba337c00e2d86f847acca1fa03ddf865fb1db935871c6f9e
|
|
| MD5 |
9a989518e60332e8c4ca7c08f00f7199
|
|
| BLAKE2b-256 |
3ace0db5bb6561df72c9baae6b9d540cc13be804d6c3491775029721d8ba70bc
|
File details
Details for the file tiktoken-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.9, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a5425409d8226e017d482120b070596450d05874e0c99c3f4e788ab9d91da64
|
|
| MD5 |
b202f28d75ee0fd80cc857fa273c0934
|
|
| BLAKE2b-256 |
268340f8e0ee4e46be4cb146f4148691573a4f36e8e36b47c9c5d6121125ba5b
|
File details
Details for the file tiktoken-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cdf72ffe83237a485c7e8f9a609d7d17041e1d866b5c5c424e83c82897f8ea0
|
|
| MD5 |
1c20272c5615342d61018c6678a4f395
|
|
| BLAKE2b-256 |
1efd614defd7524433da0e64e737c713520bb27e353ce7689e545a939045862d
|
File details
Details for the file tiktoken-0.3.0-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 702.9 kB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f9786a9f6242dd4f15be96a3b39d73428be847ee4a6ba196cae578bb9e9f76c
|
|
| MD5 |
386f990928da807321cf11d6bc5a1b1d
|
|
| BLAKE2b-256 |
f29eb4a3a5bcb3b0964196756f5885a5dcad3070f59d4a1c5eef756282747ac8
|
File details
Details for the file tiktoken-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 735.4 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38f34b23122b1a66456c8db98657a3eb5de0c2c361f35bd85dd4565c3e98edd5
|
|
| MD5 |
085608efa0c802062714ceb1a5e08d18
|
|
| BLAKE2b-256 |
5c7603b8286cd264f9f5550229fe21f72abc89d431a9a3c887fc365763acc5a4
|
File details
Details for the file tiktoken-0.3.0-cp38-cp38-win_amd64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 581.4 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22fd239a81609614cbcff331b14046c3d45ee53c38864d448d7fdbb1bb8b9754
|
|
| MD5 |
04ded3692003efdfec34f8dbb36544a2
|
|
| BLAKE2b-256 |
e1784fd783a87fcf51fa3f562a6f5d998dc382f774b804bbf202197553824309
|
File details
Details for the file tiktoken-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8, musllinux: musl 1.1+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3605d349903749787bf7c50a294a80a675bd988d7f4a4d077a813c4e055f8f4a
|
|
| MD5 |
95608c0462fde5c39bcffcc3d61f3352
|
|
| BLAKE2b-256 |
b53092405b3bc079e8af025e0f693e36c119d3e3a1c6ec2ab610dea3fb9f3b4f
|
File details
Details for the file tiktoken-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fd12d235e57ddf0e5298aaa650b62f8d9b6269378bfe2e3e480bfe887f1ec21
|
|
| MD5 |
17b56931e0b24b749eccd628caf80b99
|
|
| BLAKE2b-256 |
20c08bff69962c32342bb1360396b99ccf9c6fa743f0b599077edd36d8539b40
|
File details
Details for the file tiktoken-0.3.0-cp38-cp38-macosx_11_0_arm64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp38-cp38-macosx_11_0_arm64.whl
- Upload date:
- Size: 702.7 kB
- Tags: CPython 3.8, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77cc1c6cf80d2838131cad91dd0a1146e769b0797726591679feae9b20e3ebd6
|
|
| MD5 |
fb923dfdeaff3ed9dca0cab87dfa72ec
|
|
| BLAKE2b-256 |
d5c1545e76108d1c876012ec896b70b5f59e95f927099e9b30a18dbcf263f33a
|
File details
Details for the file tiktoken-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl.
File metadata
- Download URL: tiktoken-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 734.9 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abc660f7da3c8b47435009ea4c428a6dd6270727a4574ee8d31514290425689b
|
|
| MD5 |
93543581cadc51e58d6494041d8dfd97
|
|
| BLAKE2b-256 |
46db0dfb9b31fa82c720077b2a9af34682f09459ad7848fdd7225efb7ea148c7
|