Skip to main content

Measure layer-wise token embedding cosine similarity, assessing the severity of embedding condensation. Concept from [ICML 2026] Dispersion loss counteracts embedding condensation and improves generalization in small language models.

Project description

LM-Dispersion

arXiv PDF Project_Page ICML 2026 OpenReview GitHub Stars
Latest PyPI version PyPI download 3 month PyPI download month
LinkedIn LinkedIn LinkedIn LinkedIn
Google Scholar Google Scholar
Twitter Follow Twitter Follow Twitter Follow Twitter Follow

This is the author's repository for the ICML 2026 paper
Dispersion loss counteracts embedding condensation and improves generalization in small language models.

The official version is hosted at the Lab GitHub repo.

You are encouraged to read the illustrated walkthrough of the paper on the project website.


A 5-minute intro to this paper

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Feature 1: Larger model, less condensation.
Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

This effect is also quite robust to the choice of input datasets.

Feature 2: Reproducible when controlling for confounders.
To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Feature 3: Condensation occurs early on.
The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Feature 4: Distillation is not a solution.
Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Dispersion loss
Embedding condensation reduces the expressivity of Transformers by collapsing token embedding vectors into narrow cones, under-utilizing the representation space. We hypothesize that by dispersing embeddings during training, smaller models can achieve representational qualities more similar to larger models, thus narrowing the performance gap without increasing the number of parameters.

Our dispersion loss is inspired by the "Diffuse and Disperse" paper with practical modifications.

Dispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.

Disclaimers and future directions

Please see our project website for disclaimers and some future directions we suggest.

[New] PyPI support: embedding condensation

We have provided the computation and visualzation of embedding condensation into a PyPI package!

  1. Install or upgrade the package.
pip install embedding-condensation --upgrade
  1. Use it by simply passing in a transformers model and tokenizer, as shown in the example below.
  • max_length determines the number of tokens in the context.
  • dataset currently supports [wikipedia, pubmed, imdb, squad].
  • min_word_count and max_word_count faciliates the text parser when grabbing a random part from the dataset corpse.
  • If you have a specific text corpse, you can pass it in using the texts argument (expected format is Sequence[str]). This would bypass dataset, min_word_count and max_word_count.
import numpy as np
from transformers import AutoModel, AutoTokenizer
from embedding_condensation import measure_embedding_condensation

model = AutoModel.from_pretrained("gpt2")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2")

result = measure_embedding_condensation(
    model,
    tokenizer,
    repetitions=10,
    max_length=512,
    dataset="wikipedia",
    min_word_count=1024,
    max_word_count=1280,
    plot=True,
    show_progress=True,
    save_path="./test_embedding_condensation.png",
)
print(result.cossim_by_layer.shape)

Citation

@inproceedings{liu2026dispersion,
  title={Dispersion loss counteracts embedding condensation and improves generalization in small language models},
  author={Liu, Chen and Sun, Xingzhi and Xiao, Xi and Van Tassel, Alexandre and Xu, Ke and Reimann, Kristof and Liao, Danqi and Gerstein, Mark and Wang, Tianyang and Wang, Xiao and Krishnaswamy, Smita},
  booktitle={International conference on machine learning},
  year={2026},
  organization={PMLR}
}

Acknowledgements

  1. This work was initially motivated by the paper "A mathematical perspective on Transformers". We started this project early Apr 2025 after we watched a talk on that paper.
  2. The design of the dispersion loss was largely inspired by Runqian and Kaiming's paper "Diffuse and Disperse: Image Generation with Representation Regularization".

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_condensation-2.0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_condensation-2.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file embedding_condensation-2.0.tar.gz.

File metadata

  • Download URL: embedding_condensation-2.0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for embedding_condensation-2.0.tar.gz
Algorithm Hash digest
SHA256 c41720883c5d7f2447e86321ec9313883b123816d5b64e27ed002b5396a7a7b9
MD5 bd3f6a78aa7e15d644cd0c610c7ddb08
BLAKE2b-256 7230067c92ef7a676445b666258c84ff931cc7a5a77952408c134243eb872b7b

See more details on using hashes here.

File details

Details for the file embedding_condensation-2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for embedding_condensation-2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d7254dab2384a4779332280fcbc5545e392333f83da04a2f7607ceec72e2129
MD5 86b24d2fb80051093d9fb0a98d262401
BLAKE2b-256 80543a8b06822b989c7020d8559c8c156f26bd668116a5d02264673f110d157a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page