NN model specialized in text tokenization.
Project description
tokun
to-kun
took tokens to t-can
Current tokenizers have notorious issues that are bringing all the LLMs down. For example I could not get ChatGPT to produce a decent catch-phrase (so you're stuck with mine!).
tokun
is a NN model specialized in text tokenization.
It produces 256-embedding vectors with a 1:1 match to 64 UTF-32 bytes.
IE each tokun
embedding can be thought of as a token of length 16 characters.
But these vectors keep meaningful information on their constituting parts.
Installation
Using HuggingFace
From The Weights
Usage
Tokenization
External
Internal
Fine-Tuning
Resources
Models
The main variant of the model is tokun-16
.
Its hyper-parameters are:
ATTENTION = True # whether the blocks include an attention layer
NORMALIZATION = True # whether the blocks include a normalization layer
N_DEPTH = 3 # D, the number of successive token groupings
N_TOKEN_DIM = 4 # G, the size of a single token group, also the factor of compression
N_ENCODING_DIM = 256 # U, then dimension of a single byte as a one-hot vector
N_EMBEDDING_DIM = N_ENCODING_DIM # E, the dimension of each group
N_LATENT_DIM = N_EMBEDDING_DIM # L, the dimension of the resulting embedding
Notebooks
tokun-1
: File / Colab / Kaggle / Hugging Facetokun-4
: File / Colab / Kaggle / Hugging Facetokun-16
: File / Colab / Kaggle / Hugging Face
Articles
TODO
See TODO.
Credits
This project was inspired by a video from Andrej Karpathy, "Let's build the GPT tokenizer".
License
Licensed under the aGPLv3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.