NN model specialized in text tokenization.
Project description
tokun
to-kun
took tokens to t-can
Current tokenizers have notorious issues that are bringing all the LLMs down. For example I could not get ChatGPT to produce a decent catch-phrase (so you're stuck with mine!).
tokun
is a NN model specialized in text tokenization.
It produces 256-embedding vectors equivalent to 64 UTF-32-BE bytes.
IE each tokun
embedding can be thought of as a token of length 16 characters.
But these vectors are more than basic IDs, they keep meaningful information on their constituting parts.
The architecture, dataviz, ambition and results are detailed in the articles.
Features
The model produces vector embeddings that can be directly ingested by another model.
Regular tokens are unrelated IDs, while tokun
has the following properties:
- international:
tokun
performs evenly on the whole Unicode space - compression: the sequence length is divided by 16
- embeddings: the output vectors have only a dimension 256
- lossless: embeddings store all the information up to the byte level
- built-ins: Unicode has built-in special tokens, no need for
<|im_start|>
- meaningful: embeddings are natively related to each-other based on their parts
Installation
For now, the model is only available here, on Github.
You can simply clone the repository: in particular, it will download the weights.
Usage
The model can be loaded from the exported weights:
import os
import tensorflow as tf
import tokun.meta # default values for the hyper parameters
import tokun.model # required to register the Keras classes
import tokun.pipeline # pre and post-processing
# META ########################################################################
ATTENTION = True
NORMALIZATION = True
N_TOKEN_DIM = [4, 4] # G, for each block
# DERIVED #####################################################################
LABEL = '8.5'
VERSION = tokun.meta.version(groups=N_TOKEN_DIM, attention=ATTENTION, normalization=NORMALIZATION)
# IMPORT ######################################################################
PATH_IMPORT = os.path.join('models/', *VERSION, '{}.keras'.format(LABEL))
MODEL = tf.keras.models.load_model(PATH_IMPORT)
Since it is small (between 1 and 2M parameters depending on the exact configuration), the model can also be trained on Google Colab.
Tokenization
Once the model loaded, a text strings __s
can be encoded with:
__x = tokun.pipeline.preprocess(text=__s, groups=N_TOKEN_DIM, flatten=True)
__e = MODEL._encoder(__x) # final embedding = input for another model
Detokenization
An embedding tensor __e
(or prediction) can be reversed into Unicode text with:
__p = MODEL._decoder(__e)
__y = tokun.pipeline.postprocess(__p)
Resources
Models
The most common variations have been trained and exported to the models subdirectory.
The main variant of the model is tokun-16
.
Its hyper-parameters are:
ATTENTION = True # whether the blocks include an attention layer
NORMALIZATION = True # whether the blocks include a normalization layer
N_TOKEN_DIM = [4, 4, 4] # G, the size of the token groups, for each block
N_ENCODING_DIM = 256 # U, then dimension of a single byte as a one-hot vector
N_EMBEDDING_DIM = N_ENCODING_DIM # E, the dimension of each group
N_LATENT_DIM = N_EMBEDDING_DIM # L, the dimension of the resulting embedding
Notebooks
Final model:
Older / simpler model iterations:
Articles
TODO
See TODO.
Credits
This project was inspired by a video from Andrej Karpathy, "Let's build the GPT tokenizer".
License
Licensed under the aGPLv3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.