Skip to main content

Measure layer-wise token embedding cosine similarity, assessing the severity of embedding condensation. Concept from [ICML 2026] Dispersion loss counteracts embedding condensation and improves generalization in small language models.

Project description

LM-Dispersion

arXiv PDF Project_Page ICML 2026 OpenReview GitHub Stars
Latest PyPI version PyPI download 3 month PyPI download month
LinkedIn LinkedIn LinkedIn LinkedIn
Google Scholar Google Scholar
Twitter Follow Twitter Follow Twitter Follow Twitter Follow

This is the author's repository for the ICML 2026 paper
Dispersion loss counteracts embedding condensation and improves generalization in small language models.

The official version is hosted at the Lab GitHub repo.

You are encouraged to read the illustrated walkthrough of the paper on the project website.


A 5-minute intro to this paper

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Feature 1: Larger model, less condensation.
Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

This effect is also quite robust to the choice of input datasets.

Feature 2: Reproducible when controlling for confounders.
To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Feature 3: Condensation occurs early on.
The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Feature 4: Distillation is not a solution.
Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Dispersion loss
Embedding condensation reduces the expressivity of Transformers by collapsing token embedding vectors into narrow cones, under-utilizing the representation space. We hypothesize that by dispersing embeddings during training, smaller models can achieve representational qualities more similar to larger models, thus narrowing the performance gap without increasing the number of parameters.

Our dispersion loss is inspired by the "Diffuse and Disperse" paper with practical modifications.

Dispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.

Disclaimers and future directions

Please see our project website for disclaimers and some future directions we suggest.

[News] PyPI support: embedding condensation

We have provided the computation and visualzation of embedding condensation into a PyPI package!

  1. Install or upgrade the package.
pip install embedding-condensation --upgrade
  1. Use it by simply passing in a transformers model and tokenizer, as shown in the example below.
  • max_length determines the number of tokens in the context.
  • dataset currently supports [wikipedia, pubmed, imdb, squad].
  • min_word_count and max_word_count faciliates the text parser when grabbing a random part from the dataset corpse.
  • If you have a specific text corpse, you can pass it in using the texts argument (expected format is Sequence[str]). This would bypass dataset, min_word_count and max_word_count.
import numpy as np
import pytest
from transformers import AutoModel, AutoTokenizer
from embedding_condensation import measure_embedding_condensation

model = AutoModel.from_pretrained("gpt2")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2")

result = measure_embedding_condensation(
    model,
    tokenizer,
    repetitions=10,
    max_length=512,
    dataset="wikipedia",
    min_word_count=1024,
    max_word_count=1280,
    plot=True,
    show_progress=True,
    save_path="./test_embedding_condensation.png",
)
print(result.cossim_by_layer.shape)

Citation

@inproceedings{liu2026dispersion,
  title={Dispersion loss counteracts embedding condensation and improves generalization in small language models},
  author={Liu, Chen and Sun, Xingzhi and Xiao, Xi and Van Tassel, Alexandre and Xu, Ke and Reimann, Kristof and Liao, Danqi and Gerstein, Mark and Wang, Tianyang and Wang, Xiao and Krishnaswamy, Smita},
  booktitle={International conference on machine learning},
  year={2026},
  organization={PMLR}
}

Acknowledgements

  1. This work was initially motivated by the paper "A mathematical perspective on Transformers". We started this project early Apr 2025 after we watched a talk on that paper.
  2. The design of the dispersion loss was largely inspired by Runqian and Kaiming's paper "Diffuse and Disperse: Image Generation with Representation Regularization".

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_condensation-1.0.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_condensation-1.0.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file embedding_condensation-1.0.0.tar.gz.

File metadata

  • Download URL: embedding_condensation-1.0.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for embedding_condensation-1.0.0.tar.gz
Algorithm Hash digest
SHA256 04ae66cbdf469c0342d5ba352ea1c41e11a6fd534d44f42abc23d01940d86963
MD5 17184d0932ab9fb553ba3f3066da2879
BLAKE2b-256 dc6ed3ba4fe70915dc8be284006569f42efab6066c17a1b7fe83983f50a09c80

See more details on using hashes here.

File details

Details for the file embedding_condensation-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for embedding_condensation-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9df0dc7098d86241aa5687109a36dd958e57f58f72c4ed56c3370efe7c6b8e93
MD5 754c0b97e9142ff35cb4afdbb799701e
BLAKE2b-256 bc1974b64eb770404051c253be8c9040018867bbf392a3d2239793192ac97b0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page