Skip to main content

NN model specialized in text tokenization.

Project description

tokun

to-kun took tokens to t-can

Current tokenizers have notorious issues that are bringing all the LLMs down. For example I could not get ChatGPT to produce a decent catch-phrase (so you're stuck with mine!).

tokun is a NN model specialized in text tokenization. It produces 256-embedding vectors with a 1:1 match to 64 UTF-32 bytes.

IE each tokun embedding can be thought of as a token of length 16 characters. But these vectors keep meaningful information on their constituting parts.

Installation

Using HuggingFace

From The Weights

Usage

Tokenization

External

Internal

Fine-Tuning

Resources

Models

The main variant of the model is tokun-16.

Its hyper-parameters are:

ATTENTION = True # whether the blocks include an attention layer
NORMALIZATION = True # whether the blocks include a normalization layer

N_DEPTH = 3 # D, the number of successive token groupings
N_TOKEN_DIM = 4 # G, the size of a single token group, also the factor of compression
N_ENCODING_DIM = 256 # U, then dimension of a single byte as a one-hot vector
N_EMBEDDING_DIM = N_ENCODING_DIM # E, the dimension of each group
N_LATENT_DIM = N_EMBEDDING_DIM # L, the dimension of the resulting embedding

Notebooks

Articles

TODO

See TODO.

Credits

This project was inspired by a video from Andrej Karpathy, "Let's build the GPT tokenizer".

License

Licensed under the aGPLv3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokun-0.3.6.tar.gz (11.0 kB view hashes)

Uploaded Source

Built Distribution

tokun-0.3.6-py3-none-any.whl (12.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page