NN model specialized in text tokenization.
Project description
tokun
to-kun
took tokens to t-can
Current tokenizers have notorious issues that are bringing all the LLMs down. For example I could not get ChatGPT to produce a decent catch-phrase (so you're stuck with mine!).
tokun
is a NN model specialized in text tokenization.
It produces 256-embedding vectors equivalent to 64 UTF-32-BE bytes.
IE each tokun
embedding can be thought of as a token of length 16 characters.
But these vectors are more than basic IDs, they keep meaningful information on their constituting parts.
The architecture, dataviz, ambition and results are detailed in the articles.
Installation
For now, the model is only available here, on Github.
You can simply clone the repository: in particular, it will download the weights.
Usage
The model can be loaded from the exported weights:
import os
import tensorflow as tf
import tokun.meta # default values for the hyper parameters
import tokun.model # required to register the Keras classes
import tokun.pipeline # pre and post-processing
# META ########################################################################
ATTENTION = True
NORMALIZATION = True
N_TOKEN_DIM = [4, 4] # G, for each block
# DERIVED #####################################################################
LABEL = '8.5'
VERSION = tokun.meta.version(groups=N_TOKEN_DIM, attention=ATTENTION, normalization=NORMALIZATION)
# IMPORT ######################################################################
PATH_IMPORT = os.path.join('models/', *VERSION, '{}.keras'.format(LABEL))
MODEL = tf.keras.models.load_model(PATH_IMPORT)
Since it is small (between 1 and 2M parameters depending on the exact configuration), the model can also be trained on Google Colab.
Tokenization
Once the model loaded, a text strings __s
can be encoded with:
__x = tokun.pipeline.preprocess(text=__s, groups=N_TOKEN_DIM, flatten=True)
__e = MODEL._encoder(__x) # final embedding = input for another model
Detokenization
An embedding tensor __e
(or prediction) can be reversed into Unicode text with:
__p = MODEL._decoder(__e)
__y = tokun.pipeline.postprocess(__p)
Resources
Models
The most common variations have been trained and exported to the models subdirectory.
The main variant of the model is tokun-16
.
Its hyper-parameters are:
ATTENTION = True # whether the blocks include an attention layer
NORMALIZATION = True # whether the blocks include a normalization layer
N_TOKEN_DIM = [4, 4, 4] # G, the size of the token groups, for each block
N_ENCODING_DIM = 256 # U, then dimension of a single byte as a one-hot vector
N_EMBEDDING_DIM = N_ENCODING_DIM # E, the dimension of each group
N_LATENT_DIM = N_EMBEDDING_DIM # L, the dimension of the resulting embedding
Notebooks
Final model:
Older / simpler model iterations:
Articles
TODO
See TODO.
Credits
This project was inspired by a video from Andrej Karpathy, "Let's build the GPT tokenizer".
License
Licensed under the aGPLv3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.