Skip to main content

Оптимізація українських текстів для LLM: менше токенів, краще розуміння

Project description

dormouse

PyPI Python License CI HuggingFace

Ukrainian text optimizer for LLMs — fewer tokens, better comprehension.

Normalizes surzhyk, slang, fillers, and maps to English for cloud LLMs. Saves 60-73% tokens while improving response quality.

UA: Оптимізація українських текстів для LLM. Нормалізує суржик, сленг, мат — і стискає в англійську для Claude/GPT. Економія 60-73% токенів, якість відповідей зростає зі 67% до 100%.

Results

Tested on 53,351 texts (Telegram corpus + books), 12 IT prompts across 4 GPT models:

Metric Value
Token savings (cloud) 73%
Token savings (without seq2seq) 49%
Lexicon coverage 88%
Seq2seq exact match 98.2%
GPT response quality (original UA) 67%
GPT response quality (squeezed EN) 100%
Quality preservation 150% (squeezed > original)
Original UA:  "блін продакшн впав після деплою, що робити першим"
Squeezed EN:  "damn production crashed after deploy, what do first"
Tokens:       45 → 12 (-73%)
GPT accuracy: 67% → 100%

How it works

graph LR
    A[UA text<br/>surzhyk, slang] --> B[crack_open<br/>normalize]
    B --> C[compress<br/>remove fillers]
    C --> D[map_to_en<br/>lexicon + seq2seq]
    D --> E[EN compressed<br/>for LLM]

    style A fill:#fdd,stroke:#c33
    style E fill:#dfd,stroke:#3a3
Layer What it does How
crack_open surzhyk, slang, profanity → standard UA 360 rules + pymorphy3 lemmatization
compress remove fillers, intensifiers, noise rule-based pattern matching
map_to_en UA → compact English 47K lexicon + seq2seq (28K expression pairs)

Install

pip install dormouse-ua

# With morphological analysis (recommended)
pip install dormouse-ua[morph]

# Everything
pip install dormouse-ua[all]

Quick start

from dormouse import squeeze

# Normalize only (layers 1+2)
squeeze("шо там по баґу, пофікси плз")
# → "що там по помилці, виправ"

# Cloud mode — compress for Claude/GPT (layers 1+2+3)
squeeze("ваще нормально, канєшно зробимо", target="cloud")
# → "generally ok, sure do"

SDK Middleware (drop-in)

from openai import OpenAI
from dormouse import DormouseClient

client = DormouseClient(OpenAI())  # or Anthropic()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "шо там по деплою, він ваще не робе"}],
)
# Prompt: squeeze → EN → GPT → unsqueeze → Ukrainian response

Semantic search

from dormouse import stir, mumble, sip

stir("report.pdf")                                    # index
results = mumble("холодні закуски")                   # search by meaning
topics = sip("data.xlsx", topics=["HR", "finance"])   # classify

CLI

dormouse squeeze "шо там по баґу" -t cloud
dormouse stir book.pdf
dormouse mumble "головний герой"

Comparison with alternatives

Tool Ukrainian Token savings Approach Quality impact
dormouse native 73% normalize + compress + translate +50% quality
LLMLingua no up to 20x ML perplexity pruning (GPT-2/LLaMA) -5-15%
Selective Context no 40-50% self-information filtering -10-20%
token-reducer no 50-75% 6-stage pipeline, AST for code neutral
shrink-prompt no 30-70% domain-specific rules (<20ms) neutral
Google Translate → EN partial 30-40% full translation variable

Why dormouse is different:

The problem: Ukrainian Cyrillic costs 3-4x more tokens than equivalent English text in GPT-4/Claude.

All existing tools (LLMLingua, Selective Context, token-reducer) compress already English text by removing information. dormouse solves the problem one level earlier — transforms expensive Ukrainian (3-4 tokens/word) into cheap English (1-1.5 tokens/word) while preserving all meaning.

No other tool specifically optimizes Ukrainian for LLMs.

Use cases

Cost reduction — Ukrainian Cyrillic encodes into 2-4x more tokens than equivalent English. dormouse saves 60-73% on input tokens.

Chatbots & support — Users write in surzhyk/slang, dormouse normalizes before LLM, GPT gives concrete answers instead of generic responses.

RAG & document search — User searches in slang, documents are in literary language. dormouse normalizes both sides → finds by meaning.

AI agents — Long chains of actions eat context window. 73% compression = 73% more "memory" for the agent.

Batch processing — 10K comments through GPT for sentiment analysis. Squeeze first → cheaper and faster.

Local search & classification (no API needed)stir/mumble/sip work fully offline. Index PDF/Excel/TXT, search by meaning, classify by topics — all on CPU with local embeddings (MiniLM-L12-v2). No cloud, no keys, no cost.

Eval details

Full evaluation ran for 4 days on 53,351 texts:

Corpus: 53,351 texts (Telegram + books)
Squeeze speed: 606 texts/sec (normalization)
Seq2seq model: 7.3M params, 28K expression pairs
Stir/mumble: 8,441 chunks indexed, search ~600ms
Sip classification: 99% texts classified (8 topics)

Quality preservation (100 real prompts, automated scoring 1-5)

Model UA score Squeezed EN Preservation
GPT-4.1 4.79 4.86 102%
GPT-4.1-mini 4.71 4.68 99%
GPT-4o-mini 4.61 4.60 100%
GPT-4.1-nano 4.58 4.56 100%
GPT-5.5 4.00 4.00 100%
Gemini 2.0 Flash 4.11 4.10 100%

Squeeze preserves 99-102% quality across all tested models. GPT-4.1 actually performs better on squeezed text.

Note on GPT-5.5 scores: GPT-5.5 shows lower absolute scores (4.0 vs 4.79 for GPT-4.1) — this is an artifact of our heuristic judge (length + structure based). GPT-5.5 produces shorter, more precise answers that score lower on this metric. A proper LLM-judge eval would likely show higher scores. Preservation ratio (100%) is the meaningful metric here.

HF Inference API (small models)

Model UA score Squeezed EN Delta
Qwen2.5-72B 4.9/5 4.5/5 -0.4
Qwen2.5-7B 4.4/5 3.6/5 -0.8
Llama-3.2-1B 2.7/5 2.8/5 +0.1

For small models (<7B), use brew() with native Ukrainian — they understand UA better than squeezed EN.

Architecture

src/dormouse/
├── optimizer.py       — squeeze() main pipeline
├── rule_engine.py     — normalization (360 rules + pymorphy3)
├── compressor.py      — filler/noise removal
├── mapper.py          — UA→EN via lexicon + lemma + transliteration
├── seq2seq.py         — expression translator (GRU encoder-decoder)
├── teapot.py          — stir/mumble/sip/brew (search + LLM)
├── embedder.py        — sentence-transformers wrapper
├── middleware.py      — OpenAI/Anthropic SDK proxy
├── cli.py             — Click CLI
└── assets.py          — lazy download of models/data

Development

git clone https://github.com/ChuprinaDaria/dormouse
cd dormouse
pip install -e ".[dev,morph]"
DORMOUSE_DATA_DIR=./data pytest tests/ -v

License

MIT


Built by Daria Chuprina because she can 👾.

Lazysoft | LinkedIn | dchuprina@lazysoft.pl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dormouse_ua-0.3.4.tar.gz (82.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dormouse_ua-0.3.4-py3-none-any.whl (65.2 kB view details)

Uploaded Python 3

File details

Details for the file dormouse_ua-0.3.4.tar.gz.

File metadata

  • Download URL: dormouse_ua-0.3.4.tar.gz
  • Upload date:
  • Size: 82.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dormouse_ua-0.3.4.tar.gz
Algorithm Hash digest
SHA256 4eb26a57ebea76822b63653b86cf38eabbd1c88c2e0e65f3e56aae039a640627
MD5 5a5ccd4e5fdcf4eaa03f49c211e847df
BLAKE2b-256 f02b2adc98ad868c5999611ab8d9bcb80a123e245aa6ab987179d17ce5d45327

See more details on using hashes here.

File details

Details for the file dormouse_ua-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: dormouse_ua-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 65.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dormouse_ua-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 dd29ecb905286b395bb64ab1d21b641b88c18ceddb4adc41714f1a0fada4b3a5
MD5 7b8be9594232eacc9a415d1feb8949f8
BLAKE2b-256 c55b692013fbca1cae1471d62a8d78a49c4884468cee41437b7f4ad16bfe3c93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page