Skip to main content

Token compression for LLM prompts

Project description

UNTOKEN

Token compression for LLM prompts via a learned token selector.

UNTOKEN is a experimental architecture demonstrating adversarial autoencoder-based token importance scoring. Given N tokens, it returns a subsequence of ~0.3N tokens. The model shipped here (pacifio/untoken-v1) is trained at small scale as a proof of concept — the architecture is the contribution, not the weights.

Install

pip install untoken

Requires Python 3.10+ and PyTorch 2.1+. Works on CPU and GPU.

Context Window

The model processes up to 480 tokens per chunk (DistilBERT's 512-token limit minus special tokens). At ~20 tokens per average English sentence, that is roughly 20–24 sentences per chunk. Longer inputs are automatically split at sentence boundaries and compressed independently — no truncation occurs.

For best results, keep individual inputs under ~20 sentences. The model was trained at small scale and performs most reliably on short, self-contained passages.

Usage

from untoken import Untoken

ut = Untoken("pacifio/untoken-v1")

texts = [
    "The quick brown fox jumps over the lazy dog and then runs away into the forest.",
    "Scientists discovered a new species of deep-sea fish off the coast of Japan.",
    "The meeting was postponed due to a scheduling conflict with the board of directors.",
    "She completed the marathon in under four hours despite the difficult weather conditions.",
    "The server returned a 503 error after the deployment failed during the migration step.",
]

for text in texts:
    compressed, stats = ut.compress(text, ratio=0.4, return_stats=True)
    print(f"{text[:50]!r}...")
    print(f"  -> {compressed!r}")
    print(f"  -> {stats['original_tokens']}{stats['compressed_tokens']} tokens ({stats['savings_pct']}% savings)\n")

"""
'The quick brown fox jumps over the lazy dog and th'...
  -> 'the quick brown fox jumps over dog'
  -> 19 → 9 tokens (52.6% savings)

'Scientists discovered a new species of deep-sea fi'...
  -> 'scientists discovered a new species of sea'
  -> 18 → 9 tokens (50.0% savings)

'The meeting was postponed due to a scheduling conf'...
  -> 'the meeting was postponed due scheduling'
  -> 17 → 8 tokens (52.9% savings)

'She completed the marathon in under four hours des'...
  -> 'she completed the marathon in hours'
  -> 16 → 8 tokens (50.0% savings)

'The server returned a 503 error after the deployme'...
  -> 'the server returned a 503 the'
  -> 18 → 9 tokens (50.0% savings)
"""

Note on v1 weights: The current model was trained on a small dataset and exhibits a known failure mode — it assigns high importance to frequent function words (determiners, auxiliaries) rather than content words. This is a training data scale issue, not an architectural one. The v1 checkpoint demonstrates that the full pipeline runs end-to-end. Improving selection quality requires more training data and longer adversarial fine-tuning.

Adjustable Ratio

compressed = ut.compress(text, ratio=0.5)  # keep 50%
compressed = ut.compress(text, ratio=0.2)  # keep 20%

No retraining required — ratio is applied at inference via top-k selection.

CLI

untoken --model pacifio/untoken-v1 --input prompt.txt --ratio 0.3

Long Documents

Inputs exceeding 480 tokens are automatically chunked at sentence boundaries.

with open("document.txt") as f:
    text = f.read()

compressed = ut.compress(text, ratio=0.3)

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Method Cosine Sim ROUGE-L Compression Ratio
UNTOKEN 0.878 0.459 0.304
Random drop 0.723 0.429 0.303
Stopword removal 0.933 0.824 0.761

+15.5pp cosine similarity over random drop at equivalent compression ratio.

Architecture

The shipped artifact is a single ~300MB model:

  • Encoder: DistilBERT-base-uncased (66M parameters)
  • Importance head: Linear(768→256) → GELU → Dropout → Linear(256→1) → Sigmoid
  • Selection: hard top-k over importance scores, preserving original token order

Training is a three-phase adversarial autoencoder:

  1. Supervised warm-up — importance head trained on (original, compressed) pairs from MeetingBank
  2. Adversarial fine-tuning — full generator trained against a discriminator on CNN/DailyMail
  3. Hardening — Gumbel-softmax replaced with straight-through estimation to close the train/test gap

The reconstructor and discriminator are training-only and are not shipped.

See ARCHITECTURE.md for full details.

Performance

Primary metric — ROUGE-L:

Target ratio UNTOKEN v2 LLMLingua-2 Random drop Actual ratio (UNTOKEN / LLMLingua-2)
0.2 0.331 0.279 0.308 0.205 / 0.172
0.3 0.455 0.406 0.430 0.305 / 0.262
0.4 0.558 0.518 0.539 0.404 / 0.353
0.5 0.650 0.618 0.635 0.505 / 0.448

UNTOKEN v2 leads on ROUGE-L at every compression ratio tested. The gap over LLMLingua-2 is 4-5pp at low ratios, narrowing to 3pp at 0.5. UNTOKEN also consistently outperforms random drop, which is the baseline that requires zero learning — confirming the model is doing meaningful token selection and not just noise.

Model Size

Model Parameters Relative size
LLMLingua-2 (XLM-RoBERTa-large) ~560M 8.4× larger
LLMLingua-2 (BERT-base-multilingual) ~179M 2.7× larger
UNTOKEN v2 66.56M

Training Data

v2 was trained on 7 datasets across diverse domains:

Dataset Domain Supervision type ~Records
MeetingBank Meeting transcripts Paired (summary) 20K
CNN/DailyMail News articles Unlabeled 300K
XSum BBC news Paired (summary) 200K
DialogSum Conversation Paired (summary) 14K
BillSum Legislation Paired (summary) 23K
BookSum Long-form books Paired (summary) 12K
GSM8K Math reasoning Unlabeled (discriminator real pool) 8K

See report.md for more details.

Model

pacifio/untoken-v1 — trained on MeetingBank + CNN/DailyMail at small scale. pacifio/untoken-v2 — more diverse dataset

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untoken-0.2.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

untoken-0.2.0-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file untoken-0.2.0.tar.gz.

File metadata

  • Download URL: untoken-0.2.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for untoken-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4a8eca8915d9ab90a1187b26315d48f8fc60b3623fb4939c9a9ff0cc13c58491
MD5 90a98226bc501981f6ca1e3cfc3d0ddf
BLAKE2b-256 f3fb2c192b0c452d28c584e1eebcb77cd9d5d1e9d603ecd3038d29636019eac2

See more details on using hashes here.

File details

Details for the file untoken-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: untoken-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for untoken-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d255160b5850a814304aea4aeda1669b71e26c41262e41aa8dda936142f2198b
MD5 8412af9b0a676524456db8da90e1a01a
BLAKE2b-256 000dc02f381605dbff948e2df414b64d757dbcc5284e6759ebf6b7581a6f5251

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page