Byte-level structured token representations for parameter-efficient language models. Reference implementation.
Project description
Kronecker Embeddings
Byte-level structured token representations for parameter-efficient language models. Reference implementation.
Status: Accompanying paper submitted to arXiv; ID in process. The badge and citation will be updated once the ID is assigned.
A deterministic byte-level factorization that replaces nn.Embedding with a
fixed codec + a single trainable projection. Parameters scale with the
codec output dimension D = char_dim × pos_dim, not with vocabulary
size. At the paper's 124M GPT-2 setting (V=50,257, d_model=768, D=4096
via pos_dim=16), the stock nn.Embedding(50,257, 768) has 38.6 M
trainable params; the Kronecker projection Linear(4096, 768) has 3.1 M
— a ~12.3× input-side reduction (~91% saving), matching paper §6.9
Table 11. The saving grows with vocab and flattens with D, so it grows
as you scale up.
Quickstart
pip install kronecker-embeddings
import torch
from transformers import AutoTokenizer
from kronecker_embeddings import KroneckerEmbedding
tok = AutoTokenizer.from_pretrained("gpt2")
emb = KroneckerEmbedding(
vocab_size=tok.vocab_size,
d_model=512,
tokenizer=tok, # buffers built once at construction
pos_dim=16, # D = char_dim * pos_dim = 256 * 16 = 4096 (paper §6.9 setting)
)
ids = tok("hello world", return_tensors="pt").input_ids
out = emb(ids) # (1, L, 512) — matches nn.Embedding.forward
That's it. Use emb anywhere you'd use nn.Embedding.
What this is
Each token has a UTF-8 byte sequence — "hello" is 5 bytes, "नमस्ते" is 18.
The Kronecker codec maps that byte sequence to a fixed D = char_dim × pos_dim
dimensional vector deterministically (no learned parameters):
kappa(b) = (1/√L) · vec( Σ_{p=1..L} c_{b_p} ⊗ p_p )
where c_v is the v-th basis vector in R^256 (one-hot for byte value v) and
p_p is the p-th basis vector in R^pos_dim (one-hot for byte position p).
The result is z-normalized per-token, then passed through a single learned
linear projection Linear(D, d_model).
Architecture. Fixed codec + one nn.Linear(D, d_model, bias=False),
exactly as described in paper §3. No GELU, no hidden layer, no residual,
no norm beyond the codec's z-norm.
Note: paper §3.4 mentions an optional self-entropy regularizer on the projection's row distribution, used in the 138M companion run (paper §6.7). It is a production add-on, not part of the canonical method, and is not shipped in this reference package.
The codec is content-aware by byte: tokens that share UTF-8 byte prefixes share codec output positions, giving the embedding a built-in orthographic structure (the §6.1–6.4 result). Tokenizers that fragment language at the BPE level (Hindi, code, rare languages) get the same byte locality as English at no extra vocab cost.
Why use this
- Parameter efficiency. Embedding params scale with
D(pos_dim=16→ D=4096 in the paper's 124M experiments;pos_dim=32→ D=8192 in production), not vocab size. The bigger your tokenizer, the bigger the win. - Orthographic structure built in. Spelling variants, plural/singular
pairs, and case variants land near each other in input-embedding space
by construction. See
probes/cross_model_probe.pyfor the full §6.1–6.4 numbers across 6 public LMs. - Forced-OOV at inference. Any byte sequence — including strings that
don't have a single vocab token — produces a valid embedding. See
examples/03_inference_with_oov.py. - Drop-in compatibility. Same
forward(input_ids) -> (..., d_model)contract asnn.Embedding. State dict round-trips with a singleprojection.weighttensor; the byte buffers are reconstructed from the tokenizer at load time. - Deployment. No vocab-resize migration when adding/removing tokens —
the codec handles arbitrary byte sequences. The byte buffer is ~0.9 MB
for a 50K-vocab tokenizer at
pos_dim=16, ~4.5 MB for a 131K-vocab tokenizer atpos_dim=32.
Operational variants — mode="dynamic" vs mode="cached"
| Mode | Memory | Forward compute | Use when |
|---|---|---|---|
"dynamic" (default) |
V × (pos_dim + 2) bytes (~0.9 MB at V=50K, pos_dim=16; ~4.5 MB at V=131K production, pos_dim=32) |
~1-4 ms / micro-batch (scatter_add) | Frontier-scale training or memory-tight inference. The default. |
"cached" |
V × D × 4 bytes (gigabytes at frontier scale) |
~0 (just index_select) |
Small vocab (V ≲ 50K) or you have RAM to burn. |
Both produce bit-identical outputs (verified by tests/test_embedding.py).
Switch via the mode= kwarg:
emb = KroneckerEmbedding(..., mode="cached")
Examples
examples/01_minimal_swap.py — drop-in
replacement for nn.Embedding inside a toy encoder.
examples/02_nanogpt_integration.py —
patching nanoGPT's transformer.wte to use Kronecker. (A full training
fork lives at the forthcoming kronecker-nanogpt repo.)
examples/03_inference_with_oov.py —
forced-OOV inference: override a position's bytes with a string that has
no single-token equivalent.
Probes
The four probe scripts from the paper:
probes/cross_model_probe.py— §6.1–6.4 embedding-geometry probe over arbitrary HF models. No checkpoint needed.probes/layered_probe.py— §6.8 layered probe. Requires trained checkpoints (seeprobes/README.md).probes/robustness_probe.py— §6.11 110-prompt spelling-robustness probe.probes/generation_probe.py— §6.12 20-prompt generation probe with forced-OOV substitution.
Reference outputs from the paper's runs are in
probes/data/ — reviewers can inspect them directly without
re-running.
Citation
@article{shravan2026kronecker,
title={Kronecker Embeddings: Byte-Level Structured Token Representations
for Parameter-Efficient Language Models},
author={Shravan, Rohan},
journal={arXiv preprint arXiv:ARXIV_ID_PLACEHOLDER},
year={2026},
url={https://github.com/theschoolofai/kronecker-embeddings}
}
(The arXiv id will be replaced upon acceptance.)
See also
kronecker-nanogpt(forthcoming) — the full Kronecker-vs-stock training fork of karpathy/nanoGPT, including matched 124M checkpoints that drop into the probes here.
License
Apache 2.0 — see LICENSE. Copyright 2026 The School of AI.
Contact
- Issues / PRs: github.com/theschoolofai/kronecker-embeddings/issues
- Email:
rshravan@theschoolofai.in
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kronecker_embeddings-0.1.0.tar.gz.
File metadata
- Download URL: kronecker_embeddings-0.1.0.tar.gz
- Upload date:
- Size: 28.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8273b966069e3b91b20563d30c94d9e102bae91ab09c7ea42b79b535cc077bd7
|
|
| MD5 |
d2dd9b74e5bae88ec3e5c5a52eac837e
|
|
| BLAKE2b-256 |
9bf2bcdd0581cd1ab0190b38bf6cdafd2ded51164efa00bc90828bae677c9d4c
|
File details
Details for the file kronecker_embeddings-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kronecker_embeddings-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f3c46ec48eddad2d897a600808b4d6534c31650ca2eb7136464dae629c724f3
|
|
| MD5 |
14bb42f103e20e396c490ee1075e0d26
|
|
| BLAKE2b-256 |
b8f13de00968d65c8ffc4f838fcd7e26c9b052e6c8a24539d1b605adb6c409fc
|