Оптимізація українських текстів для LLM: менше токенів, краще розуміння
Project description
dormouse
Ukrainian text optimizer for LLMs — fewer tokens, better comprehension.
Normalizes surzhyk, slang, fillers, and maps to English for cloud LLMs. Saves 60-73% tokens while improving response quality.
UA: Оптимізація українських текстів для LLM. Нормалізує суржик, сленг, мат — і стискає в англійську для Claude/GPT. Економія 60-73% токенів, якість відповідей зростає зі 67% до 100%.
Results
Tested on 53,351 texts (Telegram corpus + books), 12 IT prompts across 4 GPT models:
| Metric | Value |
|---|---|
| Token savings (cloud) | 73% |
| Token savings (without seq2seq) | 49% |
| Lexicon coverage | 88% |
| Seq2seq exact match | 98.2% |
| GPT response quality (original UA) | 67% |
| GPT response quality (squeezed EN) | 100% |
| Quality preservation | 150% (squeezed > original) |
Original UA: "блін продакшн впав після деплою, що робити першим"
Squeezed EN: "damn production crashed after deploy, what do first"
Tokens: 45 → 12 (-73%)
GPT accuracy: 67% → 100%
How it works
graph LR
A[UA text<br/>surzhyk, slang] --> B[crack_open<br/>normalize]
B --> C[compress<br/>remove fillers]
C --> D[map_to_en<br/>lexicon + seq2seq]
D --> E[EN compressed<br/>for LLM]
style A fill:#fdd,stroke:#c33
style E fill:#dfd,stroke:#3a3
| Layer | What it does | How |
|---|---|---|
| crack_open | surzhyk, slang, profanity → standard UA | 360 rules + pymorphy3 lemmatization |
| compress | remove fillers, intensifiers, noise | rule-based pattern matching |
| map_to_en | UA → compact English | 47K lexicon + seq2seq (28K expression pairs) |
Install
pip install dormouse-ua
# With morphological analysis (recommended)
pip install dormouse-ua[morph]
# Everything
pip install dormouse-ua[all]
Quick start
from dormouse import squeeze
# Normalize only (layers 1+2)
squeeze("шо там по баґу, пофікси плз")
# → "що там по помилці, виправ"
# Cloud mode — compress for Claude/GPT (layers 1+2+3)
squeeze("ваще нормально, канєшно зробимо", target="cloud")
# → "generally ok, sure do"
SDK Middleware (drop-in)
from openai import OpenAI
from dormouse import DormouseClient
client = DormouseClient(OpenAI()) # or Anthropic()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "шо там по деплою, він ваще не робе"}],
)
# Prompt: squeeze → EN → GPT → unsqueeze → Ukrainian response
Semantic search
from dormouse import stir, mumble, sip
stir("report.pdf") # index
results = mumble("холодні закуски") # search by meaning
topics = sip("data.xlsx", topics=["HR", "finance"]) # classify
CLI
dormouse squeeze "шо там по баґу" -t cloud
dormouse stir book.pdf
dormouse mumble "головний герой"
Comparison with alternatives
Head-to-head: dormouse vs LLMLingua (same 20 prompts, GPT-4.1-nano judge)
| Method | Tokens | Savings | Quality |
|---|---|---|---|
| Original UA | 1,312 | — | 4.65/5 |
| dormouse | 620 | 53% | 4.50/5 |
| LLMLingua (on UA) | 1,182 | 10% | 4.60/5 |
| dormouse + LLMLingua | 595 | 55% | 4.60/5 |
LLMLingua achieves only 10% savings on Ukrainian — its GPT-2 perplexity model doesn't understand Cyrillic. dormouse gives 5x more compression on the same texts.
Why dormouse is different
The problem: Ukrainian Cyrillic costs 3-4x more tokens than equivalent English text in GPT-4/Claude.
| Tool | Ukrainian | Savings (on UA texts) | Approach |
|---|---|---|---|
| dormouse | native | 53% (tested) | normalize + compress + translate |
| LLMLingua | no | 10% (tested) | ML perplexity pruning |
| Selective Context | no | ~10-15%* | self-information filtering |
| token-reducer | no | ~10-15%* | 6-stage pipeline |
*Estimated — these tools use similar English-trained models, expected to perform comparably to LLMLingua on Cyrillic.
All existing compression tools work on already English text. dormouse solves the problem one level earlier — transforms expensive Ukrainian (3-4 tokens/word) into cheap English (1-1.5 tokens/word) while preserving meaning. No other tool specifically optimizes Ukrainian for LLMs.
Use cases
Cost reduction — Ukrainian Cyrillic encodes into 2-4x more tokens than equivalent English. dormouse saves 60-73% on input tokens.
Chatbots & support — Users write in surzhyk/slang, dormouse normalizes before LLM, GPT gives concrete answers instead of generic responses.
RAG & document search — User searches in slang, documents are in literary language. dormouse normalizes both sides → finds by meaning.
AI agents — Long chains of actions eat context window. 73% compression = 73% more "memory" for the agent.
Batch processing — 10K comments through GPT for sentiment analysis. Squeeze first → cheaper and faster.
Local search & classification (no API needed) — stir/mumble/sip work fully offline. Index PDF/Excel/TXT, search by meaning, classify by topics — all on CPU with local embeddings (MiniLM-L12-v2). No cloud, no keys, no cost.
Eval details
Full evaluation ran for 4 days on 53,351 texts:
Corpus: 53,351 texts (Telegram + books)
Squeeze speed: 606 texts/sec (normalization)
Seq2seq model: 7.3M params, 28K expression pairs
Stir/mumble: 8,441 chunks indexed, search ~600ms
Sip classification: 99% texts classified (8 topics)
Quality preservation (100 real prompts, automated scoring 1-5)
| Model | UA score | Squeezed EN | Preservation |
|---|---|---|---|
| GPT-4.1 | 4.79 | 4.86 | 102% |
| GPT-4.1-mini | 4.71 | 4.68 | 99% |
| GPT-4o-mini | 4.61 | 4.60 | 100% |
| GPT-4.1-nano | 4.58 | 4.56 | 100% |
| GPT-5.5 | 4.00 | 4.00 | 100% |
| Gemini 2.0 Flash | 4.11 | 4.10 | 100% |
Squeeze preserves 99-102% quality across all tested models. GPT-4.1 actually performs better on squeezed text.
Note on GPT-5.5 scores: GPT-5.5 shows lower absolute scores (4.0 vs 4.79 for GPT-4.1) — this is an artifact of our heuristic judge (length + structure based). GPT-5.5 produces shorter, more precise answers that score lower on this metric. A proper LLM-judge eval would likely show higher scores. Preservation ratio (100%) is the meaningful metric here.
HF Inference API (small models)
| Model | UA score | Squeezed EN | Delta |
|---|---|---|---|
| Qwen2.5-72B | 4.9/5 | 4.5/5 | -0.4 |
| Qwen2.5-7B | 4.4/5 | 3.6/5 | -0.8 |
| Llama-3.2-1B | 2.7/5 | 2.8/5 | +0.1 |
For small models (<7B), use
brew()with native Ukrainian — they understand UA better than squeezed EN.
Architecture
src/dormouse/
├── optimizer.py — squeeze() main pipeline
├── rule_engine.py — normalization (360 rules + pymorphy3)
├── compressor.py — filler/noise removal
├── mapper.py — UA→EN via lexicon + lemma + transliteration
├── seq2seq.py — expression translator (GRU encoder-decoder)
├── teapot.py — stir/mumble/sip/brew (search + LLM)
├── embedder.py — sentence-transformers wrapper
├── middleware.py — OpenAI/Anthropic SDK proxy
├── cli.py — Click CLI
└── assets.py — lazy download of models/data
Development
git clone https://github.com/ChuprinaDaria/dormouse
cd dormouse
pip install -e ".[dev,morph]"
DORMOUSE_DATA_DIR=./data pytest tests/ -v
License
MIT
Built by Daria Chuprina because she can 👾.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dormouse_ua-0.4.0.tar.gz.
File metadata
- Download URL: dormouse_ua-0.4.0.tar.gz
- Upload date:
- Size: 30.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a70a1454ec28dcd68f3527fe8806f1d132ef4ca060bbf29a02d413e4c9ce870
|
|
| MD5 |
32247b83262e5a8a3f12cbad743023c6
|
|
| BLAKE2b-256 |
8a4163c8610fcda1b5ffde4b3fc21ae3a64a3909a98981b21e47e0dedfa9f4d9
|
File details
Details for the file dormouse_ua-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dormouse_ua-0.4.0-py3-none-any.whl
- Upload date:
- Size: 30.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
620a7b539baecac82def92dba1d59d48bd11385eebf88e599ab2b38640a940c3
|
|
| MD5 |
6c28ff201b80723f9657b7aa597c79d4
|
|
| BLAKE2b-256 |
47b6b991df9e42b6242066e4c601dd7fa620e0aa5a7f2bb7fc971700a7eb92d9
|