tiktoken-async is a fast BPE tokeniser for use with OpenAI's models, with added support for asynchronous processing.
Project description
⏳ tiktoken-async
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
import asyncio
import tiktoken_async
enc = asyncio.run(tiktoken_async.get_encoding("cl100k_base"))
assert enc.decode(enc.encode("hello world")) == "hello world"
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = asyncio.run(tiktoken_async.encoding_for_model("gpt-4"))
The open source version of tiktoken-async
can be installed from PyPI:
pip install tiktoken-async
The tokeniser API is documented in tiktoken_async/core.py
.
Example code using tiktoken
can be found in the
OpenAI Cookbook.
Performance
tiktoken
is between 3-6x faster than a comparable open source tokeniser:
Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast
from
tokenizers==0.13.2
, transformers==4.24.0
and tiktoken==0.2.0
.
Getting help
Please post questions in the issue tracker.
If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.
Extending tiktoken
You may wish to extend tiktoken-async
to support new encodings. There are two ways to do this.
Create your Encoding
object exactly the way you want and simply pass it around.
import asyncio
cl100k_base = asyncio.run(tiktoken.get_encoding("cl100k_base"))
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken_async.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
Use the tiktoken_async_ext
plugin mechanism to register your Encoding
objects with tiktoken_async
.
This is only useful if you need tiktoken_async.get_encoding
to find your encoding, otherwise prefer
option 1.
To do this, you'll need to create a namespace package under tiktoken_async_ext
.
Layout your project like this, making sure to omit the tiktoken_ext/__init__.py
file:
my_tiktoken_extension
├── tiktoken_async_ext
│ └── my_encodings.py
└── setup.py
my_encodings.py
should be a module that contains a variable named ENCODING_CONSTRUCTORS
.
This is a dictionary from an encoding name to a function that takes no arguments and returns
arguments that can be passed to tiktoken_async.Encoding
to construct that encoding. For an example, see
tiktoken_async_ext/openai_public.py
. For precise details, see tiktoken_async/registry.py
.
Your setup.py
should look something like this:
from setuptools import setup, find_namespace_packages
setup(
name="my_tiktoken_extension",
packages=find_namespace_packages(include=['tiktoken_async_ext*']),
install_requires=["tiktoken_async"],
...
)
Then simply pip install ./my_tiktoken_extension
and you should be able to use your
custom encodings! Make sure not to use an editable install.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for tiktoken_async-0.3.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51401e718c2b46a6b1899d91d4746b93cc35e8cdcac0828a5c3653fd57af7170 |
|
MD5 | ab6ebb0439d18b14f923c1cfbdc1e3de |
|
BLAKE2b-256 | a7dc5fdb3d067ba06d8992dac3949425ea3cd68fae2409a4bb45472c9d8f7969 |
Hashes for tiktoken_async-0.3.2-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ec71580ba58f61bd0db24a96dffa43136cdaf88966026ebbf564abcf4d88b63 |
|
MD5 | 00280b3441b425e002da3929f4a9f92c |
|
BLAKE2b-256 | 01b3a394f7e49769c3d91d2bc4ccfbb27ab9f047cee7bc731c1bb3f4448cc4b4 |
Hashes for tiktoken_async-0.3.2-cp311-cp311-musllinux_1_1_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fc630bd6da9e8c91c725aad48298e13509ef8028ab23f1338e67b51a4181583 |
|
MD5 | 736c82a2e35ce52246ea5a1af98696de |
|
BLAKE2b-256 | 4121e2696138e9d29c2af510031a5223a8240d49be0c252267df7fc3fe8ec9f3 |
Hashes for tiktoken_async-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcaef0cf8782898477bfc5f5bb20171fc8a1f2be3ce6765c85d2994ade95a78a |
|
MD5 | 3ca6485505bc9919f220758209ea043d |
|
BLAKE2b-256 | 87ef2d891eff86271440c88ddc0540f0c43fe05e09a335e5f001783c1e0075d6 |
Hashes for tiktoken_async-0.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e5f9fff300ae5a81d3ca8a330f2b37b3f93cc9eece8a9aa5fd4e5f2f2df047c |
|
MD5 | 64a1c4160a7f707dfb0375d745d43d60 |
|
BLAKE2b-256 | 11ebdd69591a174d7d67624bfe24796284a5d2fbf6a2158893fa0706878218b7 |
Hashes for tiktoken_async-0.3.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 245f6b3f39c294dcfdfd936a9bb6a0aa324f42729426c3f874907ebc22668487 |
|
MD5 | 60852c95c7014dabbe1236b3903026d1 |
|
BLAKE2b-256 | 18727b7c682ad896229d175fbe5e1acd8a64e91fb83b544622c2f272129e0736 |
Hashes for tiktoken_async-0.3.2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 466d73c6bbfc360154d144b2995ab4a50957b56a5872ab76248ec68819bda191 |
|
MD5 | ba9822c931cbca9c9dc76b09ff2b3424 |
|
BLAKE2b-256 | 1250c7d41a0200b7e64a9e6856f846e8524de42d9f3b2188c72115675158cdd6 |
Hashes for tiktoken_async-0.3.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2fc609ebd993c806035618b38a608090c7ec6b3c36b59c550bf0db9685c337c |
|
MD5 | caa8d707fec63755b72d4465dfeb4708 |
|
BLAKE2b-256 | 7ff1fd33b0e0e8d64332888557f136225e57a722c4d50fd39efd5c73bc9aba0c |
Hashes for tiktoken_async-0.3.2-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f5d14f10c48b27edaf00833b402f301136060c8d610d949f22c2d427b3faa81 |
|
MD5 | b83025b3715be303dc3aa321cc781081 |
|
BLAKE2b-256 | 75dd613e4a3e828458c69358af71ad6a29767caed679ae11f9653f0f23a12721 |
Hashes for tiktoken_async-0.3.2-cp310-cp310-musllinux_1_1_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b1943c55a41daec559f3bf811c8a5622d0e3d350a5ea85099ace6c473cb169d |
|
MD5 | c19d7cbed8545cd4673c7aac9603137c |
|
BLAKE2b-256 | 00ac574c42b77ed1d7cb5ea4b5952b85120ee40a7d1df6920523abb4da770aae |
Hashes for tiktoken_async-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 520bb5144021336f071c6609a9c524011d4b8de8ece0efabd99848490138def4 |
|
MD5 | d52a734cf1d8f68948ae144fa02c00a0 |
|
BLAKE2b-256 | 82e6d739fbfb7afd567fdeff07a3d75f942c89f9b007c1de1644aac83813c273 |
Hashes for tiktoken_async-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97e1cffe734e4a168c57e66a4fccb8a0a6c82b7be763efa6c781b2ce5fe00812 |
|
MD5 | 276dff837e302c79960f7e1571d27ddb |
|
BLAKE2b-256 | 4ea572a8c93fe8e4bf535e9d8a79d43fda7dd7c23a329a014101a170be522a66 |
Hashes for tiktoken_async-0.3.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f99d02259570446f9f96138d3ef4ffe51f7ab5ec934027703c435e2e33d4cba |
|
MD5 | 59f9fa813009eca13828c6927cb13ab3 |
|
BLAKE2b-256 | 5044aff7552ddae3f0608ff4bfc58db3d4caadb24de3d0672e4cc347d902d45b |
Hashes for tiktoken_async-0.3.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1938b39f4de8424400eb5074e8cec22f812c5a7e0d58aec632a76d337fa4820 |
|
MD5 | 2805d15696a192b100990093f79b29f3 |
|
BLAKE2b-256 | 42b2c430f350199551d394f729bab7697e53fa869c51faa18dd0707930ea6e8c |
Hashes for tiktoken_async-0.3.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3bf7e043cac0ad4f879dec9a20ad0b6b9ff7d1fa2cde505fbb9c096335cd982 |
|
MD5 | 9c63f9bcf436319819c10341e4802056 |
|
BLAKE2b-256 | 9bf15b628af0ea6999c28d19ca7b994ce260a7a27ff0d3abbd50264406471525 |
Hashes for tiktoken_async-0.3.2-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bb88a994e013b7474b5d53688a0e305cd78bb921ba09e4bf17b6a76e7f36b9a |
|
MD5 | 538eef12237c1ab543a790b70612fe29 |
|
BLAKE2b-256 | 0cd6083dfc3f0463122a082853141b054643772f194c91a383d4c0fa8dd20572 |
Hashes for tiktoken_async-0.3.2-cp39-cp39-musllinux_1_1_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c15280821bf0770ccc4c2b0de5dde8f7aae2f7d3444c42bef0edf6819006092b |
|
MD5 | 8b185e7a42a087bec9a4898d63dc0e5a |
|
BLAKE2b-256 | 025cfba5f8d1c941369ffad48a6704d787bd48b7f3eaa7e5744a1545fd4f9496 |
Hashes for tiktoken_async-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa25e57c61c2581df0141cb2c3ade793d85b1abc13d7d382a0f17ae0e17ccd6f |
|
MD5 | af5f2e9b4a68eedf49329710825ffbf5 |
|
BLAKE2b-256 | 7ceece99403345831a7881829f08829b4366f89618e3a5f843f431a461aefd51 |
Hashes for tiktoken_async-0.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2efd96a1977ce9a153c4070f7ee9bbdae992af94881171afefc422ec2e12f780 |
|
MD5 | 7b60e76a316d6cc7fe461c0da9937f47 |
|
BLAKE2b-256 | 7aa3aa6c264984d89ad376c3874722480e519a54fb660adb014ff74349f83fb8 |
Hashes for tiktoken_async-0.3.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ba3bfba7999e58782ad38241c83688cec239ab760eec52bdd081f0ad86ef7bc |
|
MD5 | d0071589ec211bbbea07ba41efbbb668 |
|
BLAKE2b-256 | 01080c3bbbddc6af141115b254ce2896444f1c0f595e6e6047b53b4f6d264e50 |
Hashes for tiktoken_async-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ece21c562da8762a5eec61d1caa8e9639d76b378d30ee31d901e1babf83e30c0 |
|
MD5 | 9bdc3246e388b2dc3e4d3b812c0b44e5 |
|
BLAKE2b-256 | 3d91f4f18d69811e35c7c653653471b74012a416f7610f544e05d5d9786aa3e8 |
Hashes for tiktoken_async-0.3.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 032ef349022a4668a8c309fba147d1b8e37aa4ec9873e85344067fdea1b35146 |
|
MD5 | 182df807f3dc2906b063d9b4d8026926 |
|
BLAKE2b-256 | 5487088e691d859da5ae2a71ae108b71f4bab42037c2238d58eaaec506762423 |
Hashes for tiktoken_async-0.3.2-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4df50757bfd5133c8acea8a920c16f9fb62c95c56a505fa53d4327e28d977c0 |
|
MD5 | e0d4c9c6245606447be070849d019ac0 |
|
BLAKE2b-256 | 43c24514eec44ebdedd85f9a40844075a2b2d0edfe7d81478d92ec5d3fc2cfe1 |
Hashes for tiktoken_async-0.3.2-cp38-cp38-musllinux_1_1_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d042f38dc00653465b7459d127278a8eab8b7f6afcd92b407d4c075403f4cdda |
|
MD5 | 05466e9d224bf064f14f6539f3965a10 |
|
BLAKE2b-256 | 1ac500cfe57acf146eb702415019e307dfe4e4c18d67cc51e7c07f99a9c53a4d |
Hashes for tiktoken_async-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67c5be1908e2bc2c59855717711be35871439ab59ffca08c046746eff20b760f |
|
MD5 | 98183fdcd672ceba0dc0b15b7421a256 |
|
BLAKE2b-256 | a1f6b680fb796f37f16419e76406ff2e93016af18dacb0d5739925071ffd8ab0 |
Hashes for tiktoken_async-0.3.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfedd2ba093cfa2eed97726ca26fa6d3562009196a4eb6e541a7849073087122 |
|
MD5 | 2997a8589bc000092e6115b662de0576 |
|
BLAKE2b-256 | 31ece0ace43392dd259ce6616944792c25716d25655e990a0bbe96550307e0cd |
Hashes for tiktoken_async-0.3.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3831cd07b94fccc7afed24c66dfde28c55e38fb22617bcfce6936c29ea66b411 |
|
MD5 | 04c604cdad9dd35b06583b0dc87d82ba |
|
BLAKE2b-256 | 5c39d7189bebc53d5ab348f900692c1857a1f6e31cf7b33091fee88aa1d02224 |
Hashes for tiktoken_async-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74caa0778d9ff3093613e7317514e3adac4c67d0ea61c3288b70eb38d657397d |
|
MD5 | 5fb54c6237217a4f7d54a4599eb107ca |
|
BLAKE2b-256 | 8ae7d6c9cf2df8e520f813e388f0355df7e105809b0598d109cbe24df75f4099 |