NN model specialized in text tokenization.

Project description

tokun

to-kun took tokens to t-can

Current tokenizers have notorious issues that are bringing all the LLMs down. For example I could not get ChatGPT to produce a decent catch-phrase (so you're stuck with mine!).

tokun is a NN model specialized in text tokenization. It produces 256-embedding vectors with a 1:1 match to 64 UTF-32 bytes.

IE each tokun embedding can be thought of as a token of length 16 characters. But these vectors keep meaningful information on their constituting parts.

Installation

Using HuggingFace

From The Weights

Usage

Tokenization

External

Internal

Fine-Tuning

Resources

Models

The main variant of the model is tokun-16.

Its hyper-parameters are:

ATTENTION = True # whether the blocks include an attention layer
NORMALIZATION = True # whether the blocks include a normalization layer

N_DEPTH = 3 # D, the number of successive token groupings
N_TOKEN_DIM = 4 # G, the size of a single token group, also the factor of compression
N_ENCODING_DIM = 256 # U, then dimension of a single byte as a one-hot vector
N_EMBEDDING_DIM = N_ENCODING_DIM # E, the dimension of each group
N_LATENT_DIM = N_EMBEDDING_DIM # L, the dimension of the resulting embedding

Notebooks

tokun-1: File / Colab / Kaggle / Hugging Face
tokun-4: File / Colab / Kaggle / Hugging Face
tokun-16: File / Colab / Kaggle / Hugging Face

Articles

tokun-1: File / Notion / Medium
tokun-4: File / Notion / Medium
tokun-16: File / Notion / Medium

TODO

See TODO.

Credits

This project was inspired by a video from Andrej Karpathy, "Let's build the GPT tokenizer".

License

Licensed under the aGPLv3.

Project details

Release history Release notifications | RSS feed

0.13.5

Sep 1, 2024

0.13.4

Aug 29, 2024

0.13.3

Aug 27, 2024

0.13.2

Aug 27, 2024

0.13.1

Aug 22, 2024

0.12.1

Aug 20, 2024

0.12.0

Aug 18, 2024

0.11.3

Aug 16, 2024

0.11.2

Aug 16, 2024

0.11.1

Aug 7, 2024

0.11.0

Aug 6, 2024

0.10.10

Aug 1, 2024

0.10.9

Jul 31, 2024

0.10.8

Jul 30, 2024

0.10.6

Jul 18, 2024

0.10.5

Jul 17, 2024

0.10.4

Jul 17, 2024

0.10.3

Jul 16, 2024

0.10.2

Jul 16, 2024

0.10.1

Jul 15, 2024

0.10.0

Jul 15, 2024

0.9.9

Jul 14, 2024

0.9.8

Jul 14, 2024

0.9.7

Jul 14, 2024

0.9.6

Jul 14, 2024

0.9.5

Jul 14, 2024

0.9.4

Jul 14, 2024

0.9.3

Jul 9, 2024

0.9.3a0 pre-release

Jul 9, 2024

0.9.2a7 pre-release

Jul 9, 2024

0.9.2a6 pre-release

Jul 9, 2024

0.9.2a5 pre-release

Jul 9, 2024

0.9.2a4 pre-release

Jul 9, 2024

0.9.2a3 pre-release

Jul 9, 2024

0.9.2a2 pre-release

Jul 8, 2024

0.9.2a1 pre-release

Jul 8, 2024

0.9.2a0 pre-release

Jul 8, 2024

0.9.1

Jul 5, 2024

0.9.0

Jul 4, 2024

0.8.3

Jul 2, 2024

0.8.2

Jul 2, 2024

0.8.1

Jul 2, 2024

0.8.0

Jul 2, 2024

0.7.5

Jul 2, 2024

0.7.4

Jul 1, 2024

0.7.3

Jun 29, 2024

0.7.2

Jun 24, 2024

0.7.1

Jun 24, 2024

0.7.0

Jun 15, 2024

0.6.1

Jun 14, 2024

0.6.0

Jun 13, 2024

0.5.5

Jun 12, 2024

0.5.4

Jun 6, 2024

0.5.0

Jun 5, 2024

0.4.0

May 28, 2024

0.3.6

May 15, 2024

0.3.5

May 15, 2024

This version

0.3.4

May 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokun-0.3.4.tar.gz (10.9 kB view hashes)

Uploaded May 14, 2024 Source

Built Distribution

tokun-0.3.4-py3-none-any.whl (12.8 kB view hashes)

Uploaded May 14, 2024 Python 3

Hashes for tokun-0.3.4.tar.gz

Hashes for tokun-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`078847942417b3d432dc25b67a9608942c31bf1909614805261dceffe56fdee6`
MD5	`a5b88ba2e55be104e7ac677991122a4c`
BLAKE2b-256	`9c0586ba7661badd52fcd1189ed430b2c259da1e1b64414b18ec2494b467e318`

Hashes for tokun-0.3.4-py3-none-any.whl

Hashes for tokun-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`67d5fa487373a9ea8c78d6e0b8bab3996eaf0c895f803f70b47db1e6aa4e6154`
MD5	`b06eab42830ce57708f9750666a7d952`
BLAKE2b-256	`bed1762df608e3075b4a88e5f33bf52610fe1b3ec2aad7b48852dd51c388b02e`