Оптимізація українських текстів для LLM: менше токенів, краще розуміння

These details have not been verified by PyPI

Project links

Project description

dormouse

Ukrainian text optimizer for LLMs — fewer tokens, better comprehension.

Normalizes surzhyk, slang, fillers, and maps to English for cloud LLMs. Saves 60-73% tokens while improving response quality.

UA: Оптимізація українських текстів для LLM. Нормалізує суржик, сленг, мат — і стискає в англійську для Claude/GPT. Економія 60-73% токенів, якість відповідей зростає зі 67% до 100%.

Results

Tested on 53,351 texts (Telegram corpus + books), 12 IT prompts across 4 GPT models:

Metric	Value
Token savings (cloud)	73%
Token savings (without seq2seq)	49%
Lexicon coverage	88%
Seq2seq exact match	98.2%
GPT response quality (original UA)	67%
GPT response quality (squeezed EN)	100%
Quality preservation	150% (squeezed > original)

Original UA:  "блін продакшн впав після деплою, що робити першим"
Squeezed EN:  "damn production crashed after deploy, what do first"
Tokens:       45 → 12 (-73%)
GPT accuracy: 67% → 100%

How it works

graph LR
    A[UA text<br/>surzhyk, slang] --> B[crack_open<br/>normalize]
    B --> C[compress<br/>remove fillers]
    C --> D[map_to_en<br/>lexicon + seq2seq]
    D --> E[EN compressed<br/>for LLM]

    style A fill:#fdd,stroke:#c33
    style E fill:#dfd,stroke:#3a3

Layer	What it does	How
crack_open	surzhyk, slang, profanity → standard UA	360 rules + pymorphy3 lemmatization
compress	remove fillers, intensifiers, noise	rule-based pattern matching
map_to_en	UA → compact English	47K lexicon + seq2seq (28K expression pairs)

Install

pip install dormouse-ua

Everything works out of the box — lexicon (47K entries), seq2seq model (28K expression pairs), and vocab files are bundled in the package.

For embeddings search (stir/mumble/sip) — needs PyTorch:

pip install dormouse-ua[ml]      # + torch, sentence-transformers
pip install dormouse-ua[all]     # everything

Quick start

from dormouse import squeeze

# Normalize only (layers 1+2)
squeeze("шо там по баґу, пофікси плз")
# → "що там по помилці, виправ"

# Cloud mode — compress for Claude/GPT (layers 1+2+3)
squeeze("ваще нормально, канєшно зробимо", target="cloud")
# → "generally ok, sure do"

SDK Middleware (drop-in)

from openai import OpenAI
from dormouse import DormouseClient

client = DormouseClient(OpenAI())  # or Anthropic()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "шо там по деплою, він ваще не робе"}],
)
# Prompt: squeeze → EN → GPT → unsqueeze → Ukrainian response

Semantic search

from dormouse import stir, mumble, sip

stir("report.pdf")                                    # index
results = mumble("холодні закуски")                   # search by meaning
topics = sip("data.xlsx", topics=["HR", "finance"])   # classify

CLI

dormouse squeeze "шо там по баґу" -t cloud
dormouse stir book.pdf
dormouse mumble "головний герой"

Comparison with alternatives

Head-to-head: dormouse vs LLMLingua (same 20 prompts, GPT-4.1-nano judge)

Method	Tokens	Savings	Quality
Original UA	1,312	—	4.65/5
dormouse	620	53%	4.50/5
LLMLingua (on UA)	1,182	10%	4.60/5
dormouse + LLMLingua	595	55%	4.60/5

LLMLingua achieves only 10% savings on Ukrainian — its GPT-2 perplexity model doesn't understand Cyrillic. dormouse gives 5x more compression on the same texts.

Why dormouse is different

The problem: Ukrainian Cyrillic costs 3-4x more tokens than equivalent English text in GPT-4/Claude.

Tool	Ukrainian	Savings (on UA texts)	Approach
dormouse	native	53% (tested)	normalize + compress + translate
LLMLingua	no	10% (tested)	ML perplexity pruning
Selective Context	no	~10-15%*	self-information filtering
token-reducer	no	~10-15%*	6-stage pipeline

*Estimated — these tools use similar English-trained models, expected to perform comparably to LLMLingua on Cyrillic.

All existing compression tools work on already English text. dormouse solves the problem one level earlier — transforms expensive Ukrainian (3-4 tokens/word) into cheap English (1-1.5 tokens/word) while preserving meaning. No other tool specifically optimizes Ukrainian for LLMs.

Use cases

Cost reduction — Ukrainian Cyrillic encodes into 2-4x more tokens than equivalent English. dormouse saves 60-73% on input tokens.

Chatbots & support — Users write in surzhyk/slang, dormouse normalizes before LLM, GPT gives concrete answers instead of generic responses.

RAG & document search — User searches in slang, documents are in literary language. dormouse normalizes both sides → finds by meaning.

AI agents — Long chains of actions eat context window. 73% compression = 73% more "memory" for the agent.

Batch processing — 10K comments through GPT for sentiment analysis. Squeeze first → cheaper and faster.

Local search & classification (no API needed) — stir/mumble/sip work fully offline. Index PDF/Excel/TXT, search by meaning, classify by topics — all on CPU with local embeddings (MiniLM-L12-v2). No cloud, no keys, no cost.

Eval details

Full evaluation ran for 4 days on 53,351 texts:

Corpus: 53,351 texts (Telegram + books)
Squeeze speed: 606 texts/sec (normalization)
Seq2seq model: 7.3M params, 28K expression pairs
Stir/mumble: 8,441 chunks indexed, search ~600ms
Sip classification: 99% texts classified (8 topics)

Quality preservation (100 real prompts, automated scoring 1-5)

Model	UA score	Squeezed EN	Preservation
GPT-4.1	4.79	4.86	102%
GPT-4.1-mini	4.71	4.68	99%
GPT-4o-mini	4.61	4.60	100%
GPT-4.1-nano	4.58	4.56	100%
GPT-5.5	4.00	4.00	100%
Gemini 2.0 Flash	4.11	4.10	100%

Squeeze preserves 99-102% quality across all tested models. GPT-4.1 actually performs better on squeezed text.

Note on GPT-5.5 scores: GPT-5.5 shows lower absolute scores (4.0 vs 4.79 for GPT-4.1) — this is an artifact of our heuristic judge (length + structure based). GPT-5.5 produces shorter, more precise answers that score lower on this metric. A proper LLM-judge eval would likely show higher scores. Preservation ratio (100%) is the meaningful metric here.

HF Inference API (small models)

Model	UA score	Squeezed EN	Delta
Qwen2.5-72B	4.9/5	4.5/5	-0.4
Qwen2.5-7B	4.4/5	3.6/5	-0.8
Llama-3.2-1B	2.7/5	2.8/5	+0.1

For small models (<7B), use brew() with native Ukrainian — they understand UA better than squeezed EN.

Architecture

src/dormouse/
├── optimizer.py       — squeeze() main pipeline
├── rule_engine.py     — normalization (360 rules + pymorphy3)
├── compressor.py      — filler/noise removal
├── mapper.py          — UA→EN via lexicon + lemma + transliteration
├── seq2seq.py         — expression translator (GRU encoder-decoder)
├── teapot.py          — stir/mumble/sip/brew (search + LLM)
├── embedder.py        — sentence-transformers wrapper
├── middleware.py      — OpenAI/Anthropic SDK proxy
├── cli.py             — Click CLI
├── assets.py          — bundled data + lazy download fallback
└── data/              — lexicon.db, seq2seq model, vocab, rules

Development

git clone https://github.com/ChuprinaDaria/dormouse
cd dormouse
pip install -e ".[dev,morph]"
DORMOUSE_DATA_DIR=./data pytest tests/ -v

License

MIT

Built by Daria Chuprina because she can 👾.

Lazysoft | LinkedIn | dchuprina@lazysoft.pl

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.3

May 8, 2026

This version

0.4.2

May 5, 2026

0.4.1

May 5, 2026

0.4.0

May 5, 2026

0.3.4

May 5, 2026

0.3.3

May 5, 2026

0.3.2

May 5, 2026

0.3.1

May 5, 2026

0.3.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dormouse_ua-0.4.2.tar.gz (30.4 MB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dormouse_ua-0.4.2-py3-none-any.whl (30.2 MB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file dormouse_ua-0.4.2.tar.gz.

File metadata

Download URL: dormouse_ua-0.4.2.tar.gz
Upload date: May 5, 2026
Size: 30.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dormouse_ua-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`67bf3a813c91e0e5c678f106238c630cb0eafe7fd448a43ca848544219a3e73b`
MD5	`ca3c5fb79461338d85e425e205927a60`
BLAKE2b-256	`e48c73c0edcb7eacd89ed806027efe0aacff8f159a26470cb9c9dcb5bee19b3b`

See more details on using hashes here.

File details

Details for the file dormouse_ua-0.4.2-py3-none-any.whl.

File metadata

Download URL: dormouse_ua-0.4.2-py3-none-any.whl
Upload date: May 5, 2026
Size: 30.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dormouse_ua-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`feb88190e2240f9e8664dda0aa9fd39366b37710886259b6feb4be078f52eedf`
MD5	`2a1d29457bf26529e28b1203d2dec1bf`
BLAKE2b-256	`6a2c03dfd938fbe72117f949e0104b6a48866b862910f7746498ba92e95864e9`

See more details on using hashes here.

dormouse-ua 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dormouse

Results

How it works

Install

Quick start

SDK Middleware (drop-in)

Semantic search

CLI

Comparison with alternatives

Head-to-head: dormouse vs LLMLingua (same 20 prompts, GPT-4.1-nano judge)

Why dormouse is different

Use cases

Eval details

Quality preservation (100 real prompts, automated scoring 1-5)

HF Inference API (small models)

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes