Skip to main content

tiktoken-async is a fast BPE tokeniser for use with OpenAI's models, with added support for asynchronous processing.

Project description

⏳ tiktoken-async

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

import asyncio
import tiktoken_async
enc = asyncio.run(tiktoken_async.get_encoding("cl100k_base"))
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = asyncio.run(tiktoken_async.encoding_for_model("gpt-4"))

The open source version of tiktoken-async can be installed from PyPI:

pip install tiktoken-async

The tokeniser API is documented in tiktoken_async/core.py.

Example code using tiktoken can be found in the OpenAI Cookbook.

Performance

tiktoken is between 3-6x faster than a comparable open source tokeniser:

image

Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from tokenizers==0.13.2, transformers==4.24.0 and tiktoken==0.2.0.

Getting help

Please post questions in the issue tracker.

If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.

Extending tiktoken

You may wish to extend tiktoken-async to support new encodings. There are two ways to do this.

Create your Encoding object exactly the way you want and simply pass it around.

import asyncio

cl100k_base = asyncio.run(tiktoken.get_encoding("cl100k_base"))

# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken_async.Encoding(
    # If you're changing the set of special tokens, make sure to use a different name
    # It should be clear from the name what behaviour to expect.
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
    }
)

Use the tiktoken_async_ext plugin mechanism to register your Encoding objects with tiktoken_async.

This is only useful if you need tiktoken_async.get_encoding to find your encoding, otherwise prefer option 1.

To do this, you'll need to create a namespace package under tiktoken_async_ext.

Layout your project like this, making sure to omit the tiktoken_ext/__init__.py file:

my_tiktoken_extension
├── tiktoken_async_ext
│   └── my_encodings.py
└── setup.py

my_encodings.py should be a module that contains a variable named ENCODING_CONSTRUCTORS. This is a dictionary from an encoding name to a function that takes no arguments and returns arguments that can be passed to tiktoken_async.Encoding to construct that encoding. For an example, see tiktoken_async_ext/openai_public.py. For precise details, see tiktoken_async/registry.py.

Your setup.py should look something like this:

from setuptools import setup, find_namespace_packages

setup(
    name="my_tiktoken_extension",
    packages=find_namespace_packages(include=['tiktoken_async_ext*']),
    install_requires=["tiktoken_async"],
    ...
)

Then simply pip install ./my_tiktoken_extension and you should be able to use your custom encodings! Make sure not to use an editable install.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiktoken-async-0.3.2.tar.gz (25.2 kB view hashes)

Uploaded Source

Built Distributions

tiktoken_async-0.3.2-cp311-cp311-win_amd64.whl (579.5 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

tiktoken_async-0.3.2-cp311-cp311-musllinux_1_1_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

tiktoken_async-0.3.2-cp311-cp311-musllinux_1_1_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ ARM64

tiktoken_async-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tiktoken_async-0.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

tiktoken_async-0.3.2-cp311-cp311-macosx_11_0_arm64.whl (705.5 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

tiktoken_async-0.3.2-cp311-cp311-macosx_10_9_x86_64.whl (735.6 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

tiktoken_async-0.3.2-cp310-cp310-win_amd64.whl (579.5 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

tiktoken_async-0.3.2-cp310-cp310-musllinux_1_1_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

tiktoken_async-0.3.2-cp310-cp310-musllinux_1_1_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ ARM64

tiktoken_async-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tiktoken_async-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

tiktoken_async-0.3.2-cp310-cp310-macosx_11_0_arm64.whl (705.5 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

tiktoken_async-0.3.2-cp310-cp310-macosx_10_9_x86_64.whl (735.6 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

tiktoken_async-0.3.2-cp39-cp39-win_amd64.whl (579.8 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

tiktoken_async-0.3.2-cp39-cp39-musllinux_1_1_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

tiktoken_async-0.3.2-cp39-cp39-musllinux_1_1_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ ARM64

tiktoken_async-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tiktoken_async-0.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

tiktoken_async-0.3.2-cp39-cp39-macosx_11_0_arm64.whl (706.0 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

tiktoken_async-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl (736.3 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

tiktoken_async-0.3.2-cp38-cp38-win_amd64.whl (579.8 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

tiktoken_async-0.3.2-cp38-cp38-musllinux_1_1_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

tiktoken_async-0.3.2-cp38-cp38-musllinux_1_1_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ ARM64

tiktoken_async-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

tiktoken_async-0.3.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

tiktoken_async-0.3.2-cp38-cp38-macosx_11_0_arm64.whl (705.5 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

tiktoken_async-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl (736.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page