Skip to main content

NN model specialized in text tokenization.

Project description

tokun

to-kun took tokens to t-can

Current tokenizers have notorious issues that are bringing all the LLMs down. For example I could not get ChatGPT to produce a decent catch-phrase (so you're stuck with mine!).

tokun is a NN model specialized in text tokenization. It produces 256-embedding vectors equivalent to 64 UTF-32-BE bytes.

IE each tokun embedding can be thought of as a token of length 16 characters.

But these vectors are more than basic IDs, they keep meaningful information on their constituting parts.

The architecture, dataviz, ambition and results are detailed in the articles.

Installation

For now, the model is only available here, on Github.

You can simply clone the repository: in particular, it will download the weights.

Usage

The model can be loaded from the exported weights:

import os
import tensorflow as tf

import tokun.meta # default values for the hyper parameters
import tokun.model # required to register the Keras classes
import tokun.pipeline # pre and post-processing

# META ########################################################################

ATTENTION = True
NORMALIZATION = True

N_TOKEN_DIM = [4, 4] # G, for each block

# DERIVED #####################################################################

LABEL = '8.5'
VERSION = tokun.meta.version(groups=N_TOKEN_DIM, attention=ATTENTION, normalization=NORMALIZATION)

# IMPORT ######################################################################

PATH_IMPORT = os.path.join('models/', *VERSION, '{}.keras'.format(LABEL))

MODEL = tf.keras.models.load_model(PATH_IMPORT)

Since it is small (between 1 and 2M parameters depending on the exact configuration), the model can also be trained on Google Colab.

Tokenization

Once the model loaded, a text strings __s can be encoded with:

__x = tokun.pipeline.preprocess(text=__s, groups=N_TOKEN_DIM, flatten=True)
__e = MODEL._encoder(__x) # final embedding = input for another model

Detokenization

An embedding tensor __e (or prediction) can be reversed into Unicode text with:

__p = MODEL._decoder(__e)
__y = tokun.pipeline.postprocess(__p)

Resources

Models

The most common variations have been trained and exported to the models subdirectory.

The main variant of the model is tokun-16.

Its hyper-parameters are:

ATTENTION = True # whether the blocks include an attention layer
NORMALIZATION = True # whether the blocks include a normalization layer

N_TOKEN_DIM = [4, 4, 4] # G, the size of the token groups, for each block
N_ENCODING_DIM = 256 # U, then dimension of a single byte as a one-hot vector
N_EMBEDDING_DIM = N_ENCODING_DIM # E, the dimension of each group
N_LATENT_DIM = N_EMBEDDING_DIM # L, the dimension of the resulting embedding

Notebooks

Final model:

Older / simpler model iterations:

Articles

TODO

See TODO.

Credits

This project was inspired by a video from Andrej Karpathy, "Let's build the GPT tokenizer".

License

Licensed under the aGPLv3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokun-0.4.0.tar.gz (12.9 kB view hashes)

Uploaded Source

Built Distribution

tokun-0.4.0-py3-none-any.whl (14.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page