Token compression for LLM prompts

These details have not been verified by PyPI

Project links

Homepage

Project description

UNTOKEN

Token compression for LLM prompts via a learned token selector.

UNTOKEN is a experimental architecture demonstrating adversarial autoencoder-based token importance scoring. Given N tokens, it returns a subsequence of ~0.3N tokens. The model shipped here (pacifio/untoken-v1) is trained at small scale as a proof of concept — the architecture is the contribution, not the weights.

Install

pip install untoken

Requires Python 3.10+ and PyTorch 2.1+. Works on CPU and GPU.

Context Window

The model processes up to 480 tokens per chunk (DistilBERT's 512-token limit minus special tokens). At ~20 tokens per average English sentence, that is roughly 20–24 sentences per chunk. Longer inputs are automatically split at sentence boundaries and compressed independently — no truncation occurs.

For best results, keep individual inputs under ~20 sentences. The model was trained at small scale and performs most reliably on short, self-contained passages.

Usage

from untoken import Untoken

ut = Untoken("pacifio/untoken-v2")

texts = [
    "The quick brown fox jumps over the lazy dog and then runs away into the forest.",
    "Scientists discovered a new species of deep-sea fish off the coast of Japan.",
    "The meeting was postponed due to a scheduling conflict with the board of directors.",
    "She completed the marathon in under four hours despite the difficult weather conditions.",
    "The server returned a 503 error after the deployment failed during the migration step.",
]

for text in texts:
    compressed, stats = ut.compress(text, ratio=0.4, return_stats=True)
    print(f"{text[:50]!r}...")
    print(f"  -> {compressed!r}")
    print(f"  -> {stats['original_tokens']} → {stats['compressed_tokens']} tokens ({stats['savings_pct']}% savings)\n")

"""
'The quick brown fox jumps over the lazy dog and th'...
  -> 'the quick brown fox jumps over dog'
  -> 19 → 9 tokens (52.6% savings)

'Scientists discovered a new species of deep-sea fi'...
  -> 'scientists discovered a new species of sea'
  -> 18 → 9 tokens (50.0% savings)

'The meeting was postponed due to a scheduling conf'...
  -> 'the meeting was postponed due scheduling'
  -> 17 → 8 tokens (52.9% savings)

'She completed the marathon in under four hours des'...
  -> 'she completed the marathon in hours'
  -> 16 → 8 tokens (50.0% savings)

'The server returned a 503 error after the deployme'...
  -> 'the server returned a 503 the'
  -> 18 → 9 tokens (50.0% savings)
"""

Note on v1 weights: The current model was trained on a small dataset and exhibits a known failure mode — it assigns high importance to frequent function words (determiners, auxiliaries) rather than content words. This is a training data scale issue, not an architectural one. The v1 checkpoint demonstrates that the full pipeline runs end-to-end. Improving selection quality requires more training data and longer adversarial fine-tuning.

Adjustable Ratio

compressed = ut.compress(text, ratio=0.5)  # keep 50%
compressed = ut.compress(text, ratio=0.2)  # keep 20%

No retraining required — ratio is applied at inference via top-k selection.

CLI

untoken --model pacifio/untoken-v1 --input prompt.txt --ratio 0.3

Long Documents

Inputs exceeding 480 tokens are automatically chunked at sentence boundaries.

with open("document.txt") as f:
    text = f.read()

compressed = ut.compress(text, ratio=0.3)

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Method	Cosine Sim	ROUGE-L	Compression Ratio
UNTOKEN	0.878	0.459	0.304
Random drop	0.723	0.429	0.303
Stopword removal	0.933	0.824	0.761

+15.5pp cosine similarity over random drop at equivalent compression ratio.

Architecture

The shipped artifact is a single ~300MB model:

Encoder: DistilBERT-base-uncased (66M parameters)
Importance head: Linear(768→256) → GELU → Dropout → Linear(256→1) → Sigmoid
Selection: hard top-k over importance scores, preserving original token order

Training is a three-phase adversarial autoencoder:

Supervised warm-up — importance head trained on (original, compressed) pairs from MeetingBank
Adversarial fine-tuning — full generator trained against a discriminator on CNN/DailyMail
Hardening — Gumbel-softmax replaced with straight-through estimation to close the train/test gap

The reconstructor and discriminator are training-only and are not shipped.

See ARCHITECTURE.md for full details.

Performance

Primary metric — ROUGE-L:

Target ratio	UNTOKEN v2	LLMLingua-2	Random drop	Actual ratio (UNTOKEN / LLMLingua-2)
0.2	0.331	0.279	0.308	0.205 / 0.172
0.3	0.455	0.406	0.430	0.305 / 0.262
0.4	0.558	0.518	0.539	0.404 / 0.353
0.5	0.650	0.618	0.635	0.505 / 0.448

UNTOKEN v2 leads on ROUGE-L at every compression ratio tested. The gap over LLMLingua-2 is 4-5pp at low ratios, narrowing to 3pp at 0.5. UNTOKEN also consistently outperforms random drop, which is the baseline that requires zero learning — confirming the model is doing meaningful token selection and not just noise.

Model Size

Model	Parameters	Relative size
LLMLingua-2 (XLM-RoBERTa-large)	~560M	8.4× larger
LLMLingua-2 (BERT-base-multilingual)	~179M	2.7× larger
UNTOKEN v2	66.56M	1×

Training Data

v2 was trained on 7 datasets across diverse domains:

Dataset	Domain	Supervision type	~Records
MeetingBank	Meeting transcripts	Paired (summary)	20K
CNN/DailyMail	News articles	Unlabeled	300K
XSum	BBC news	Paired (summary)	200K
DialogSum	Conversation	Paired (summary)	14K
BillSum	Legislation	Paired (summary)	23K
BookSum	Long-form books	Paired (summary)	12K
GSM8K	Math reasoning	Unlabeled (discriminator real pool)	8K

See report.md for more details.

Model

pacifio/untoken-v1 — trained on MeetingBank + CNN/DailyMail at small scale. pacifio/untoken-v2 — more diverse dataset

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.1

Mar 10, 2026

0.2.0

Mar 10, 2026

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untoken-0.2.1.tar.gz (16.1 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

untoken-0.2.1-py3-none-any.whl (8.3 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file untoken-0.2.1.tar.gz.

File metadata

Download URL: untoken-0.2.1.tar.gz
Upload date: Mar 10, 2026
Size: 16.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for untoken-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`4463100cc7294fe030ff9f49a336a2ddda0104dd9d19121464612472b5a1a49d`
MD5	`821faa1de8a707c6dba8ae0526ca12ef`
BLAKE2b-256	`a4af1e193b55b63d91c00ede473e1dce62f98207743e2d68fe65905614f7ab93`

See more details on using hashes here.

File details

Details for the file untoken-0.2.1-py3-none-any.whl.

File metadata

Download URL: untoken-0.2.1-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for untoken-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e68e2a45332b366cba3b0e6388815d8020c94cc372446ef5b5a933e92a57e88c`
MD5	`a53fb2da8a4d99a9bec86fabfef0a51b`
BLAKE2b-256	`82f298721dd3399d1ca9385b81ff78b2b7cda0347d9a610524a65fd27c535998`

See more details on using hashes here.

untoken 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

UNTOKEN

Install

Context Window

Usage

Adjustable Ratio

CLI

Long Documents

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Architecture

Performance

Model Size

Training Data

Model

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes