Skip to main content

Token compression for LLM prompts

Project description

UNTOKEN

Token compression for LLM prompts via a learned token selector.

UNTOKEN is a experimental architecture demonstrating adversarial autoencoder-based token importance scoring. Given N tokens, it returns a subsequence of ~0.3N tokens. The model shipped here (pacifio/untoken-v1) is trained at small scale as a proof of concept — the architecture is the contribution, not the weights.

Install

pip install untoken

Requires Python 3.10+ and PyTorch 2.1+. Works on CPU and GPU.

Context Window

The model processes up to 480 tokens per chunk (DistilBERT's 512-token limit minus special tokens). At ~20 tokens per average English sentence, that is roughly 20–24 sentences per chunk. Longer inputs are automatically split at sentence boundaries and compressed independently — no truncation occurs.

For best results, keep individual inputs under ~20 sentences. The model was trained at small scale and performs most reliably on short, self-contained passages.

Usage

from untoken import Untoken

ut = Untoken("pacifio/untoken-v2")

texts = [
    "The quick brown fox jumps over the lazy dog and then runs away into the forest.",
    "Scientists discovered a new species of deep-sea fish off the coast of Japan.",
    "The meeting was postponed due to a scheduling conflict with the board of directors.",
    "She completed the marathon in under four hours despite the difficult weather conditions.",
    "The server returned a 503 error after the deployment failed during the migration step.",
]

for text in texts:
    compressed, stats = ut.compress(text, ratio=0.4, return_stats=True)
    print(f"{text[:50]!r}...")
    print(f"  -> {compressed!r}")
    print(f"  -> {stats['original_tokens']}{stats['compressed_tokens']} tokens ({stats['savings_pct']}% savings)\n")

"""
'The quick brown fox jumps over the lazy dog and th'...
  -> 'the quick brown fox jumps over dog'
  -> 19 → 9 tokens (52.6% savings)

'Scientists discovered a new species of deep-sea fi'...
  -> 'scientists discovered a new species of sea'
  -> 18 → 9 tokens (50.0% savings)

'The meeting was postponed due to a scheduling conf'...
  -> 'the meeting was postponed due scheduling'
  -> 17 → 8 tokens (52.9% savings)

'She completed the marathon in under four hours des'...
  -> 'she completed the marathon in hours'
  -> 16 → 8 tokens (50.0% savings)

'The server returned a 503 error after the deployme'...
  -> 'the server returned a 503 the'
  -> 18 → 9 tokens (50.0% savings)
"""

Note on v1 weights: The current model was trained on a small dataset and exhibits a known failure mode — it assigns high importance to frequent function words (determiners, auxiliaries) rather than content words. This is a training data scale issue, not an architectural one. The v1 checkpoint demonstrates that the full pipeline runs end-to-end. Improving selection quality requires more training data and longer adversarial fine-tuning.

Adjustable Ratio

compressed = ut.compress(text, ratio=0.5)  # keep 50%
compressed = ut.compress(text, ratio=0.2)  # keep 20%

No retraining required — ratio is applied at inference via top-k selection.

CLI

untoken --model pacifio/untoken-v1 --input prompt.txt --ratio 0.3

Long Documents

Inputs exceeding 480 tokens are automatically chunked at sentence boundaries.

with open("document.txt") as f:
    text = f.read()

compressed = ut.compress(text, ratio=0.3)

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Method Cosine Sim ROUGE-L Compression Ratio
UNTOKEN 0.878 0.459 0.304
Random drop 0.723 0.429 0.303
Stopword removal 0.933 0.824 0.761

+15.5pp cosine similarity over random drop at equivalent compression ratio.

Architecture

The shipped artifact is a single ~300MB model:

  • Encoder: DistilBERT-base-uncased (66M parameters)
  • Importance head: Linear(768→256) → GELU → Dropout → Linear(256→1) → Sigmoid
  • Selection: hard top-k over importance scores, preserving original token order

Training is a three-phase adversarial autoencoder:

  1. Supervised warm-up — importance head trained on (original, compressed) pairs from MeetingBank
  2. Adversarial fine-tuning — full generator trained against a discriminator on CNN/DailyMail
  3. Hardening — Gumbel-softmax replaced with straight-through estimation to close the train/test gap

The reconstructor and discriminator are training-only and are not shipped.

See ARCHITECTURE.md for full details.

Performance

Primary metric — ROUGE-L:

Target ratio UNTOKEN v2 LLMLingua-2 Random drop Actual ratio (UNTOKEN / LLMLingua-2)
0.2 0.331 0.279 0.308 0.205 / 0.172
0.3 0.455 0.406 0.430 0.305 / 0.262
0.4 0.558 0.518 0.539 0.404 / 0.353
0.5 0.650 0.618 0.635 0.505 / 0.448

UNTOKEN v2 leads on ROUGE-L at every compression ratio tested. The gap over LLMLingua-2 is 4-5pp at low ratios, narrowing to 3pp at 0.5. UNTOKEN also consistently outperforms random drop, which is the baseline that requires zero learning — confirming the model is doing meaningful token selection and not just noise.

Model Size

Model Parameters Relative size
LLMLingua-2 (XLM-RoBERTa-large) ~560M 8.4× larger
LLMLingua-2 (BERT-base-multilingual) ~179M 2.7× larger
UNTOKEN v2 66.56M

Training Data

v2 was trained on 7 datasets across diverse domains:

Dataset Domain Supervision type ~Records
MeetingBank Meeting transcripts Paired (summary) 20K
CNN/DailyMail News articles Unlabeled 300K
XSum BBC news Paired (summary) 200K
DialogSum Conversation Paired (summary) 14K
BillSum Legislation Paired (summary) 23K
BookSum Long-form books Paired (summary) 12K
GSM8K Math reasoning Unlabeled (discriminator real pool) 8K

See report.md for more details.

Model

pacifio/untoken-v1 — trained on MeetingBank + CNN/DailyMail at small scale. pacifio/untoken-v2 — more diverse dataset

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untoken-0.2.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

untoken-0.2.1-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file untoken-0.2.1.tar.gz.

File metadata

  • Download URL: untoken-0.2.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for untoken-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4463100cc7294fe030ff9f49a336a2ddda0104dd9d19121464612472b5a1a49d
MD5 821faa1de8a707c6dba8ae0526ca12ef
BLAKE2b-256 a4af1e193b55b63d91c00ede473e1dce62f98207743e2d68fe65905614f7ab93

See more details on using hashes here.

File details

Details for the file untoken-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: untoken-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for untoken-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e68e2a45332b366cba3b0e6388815d8020c94cc372446ef5b5a933e92a57e88c
MD5 a53fb2da8a4d99a9bec86fabfef0a51b
BLAKE2b-256 82f298721dd3399d1ca9385b81ff78b2b7cda0347d9a610524a65fd27c535998

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page