Russian text normalization for TTS: numbers, dates, currency, units, fractions, case agreement, plus an uncertainty router - one file, regex only, no dependencies
Project description
rutextnorm — Russian text normalization for TTS
Turn written Russian into something a TTS model can say: numbers, dates, money, units, fractions, times, abbreviations, symbols, and mixed Latin/Cyrillic — all spelled out, in agreement, in words.
"В 2024 году инфляция составила 7,5%, а доходы выросли на 3 млрд руб."
↓ normalize_russian()
"В две тысячи двадцать четвёртом году инфляция составила семь целых
и пять десятых процента, а доходы выросли на три миллиарда рублей"
- One file, zero dependencies, no network, no ML. Pure
re+ lookup tables. Deterministic: same input → same output. ~0.17 ms/sentence (~375k chars/s). - Knows when it might be wrong.
flag_uncertain()returns the spans the rules can't resolve from the text, so you can route just those to a slower, stronger method (a neural normalizer or LLM) and trust the fast path everywhere else. - Built for TTS, not for a benchmark. Where speakability and a corpus's written form disagree, it favours what the synthesizer should pronounce (see Design choices & gotchas).
Install
pip install rutextnorm
The PyPI name and the import name are the same:
from rutextnorm import normalize_russian
Or vendor the single file — copy rutextnorm.py straight into your project
(e.g. into a TTS repo's text/ folder). Nothing else is required. When vendored,
the import follows wherever you put it:
from text.rutextnorm import normalize_russian
Requires Python ≥ 3.8.
Quick start
from rutextnorm import normalize_russian
text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
В моём кошельке было 876 UAH и 543.21 RUB, а также я нашёл 20 центов."""
print(normalize_russian(text))
У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.
В моём кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать одна копейка, а также я нашёл двадцать центов.
Command-line filter
echo "цена 1 500 руб." | python3 -m rutextnorm # installed
echo "цена 1 500 руб." | python3 rutextnorm.py # vendored
# -> цена тысяча пятьсот рублей
Use cases
- TTS front-end. Run text through
normalize_russianbefore your G2P / acoustic model so the synthesizer never has to guess how to read7,5%or$3 млрд. - Hybrid pipeline. Use
flag_uncertainas a router: the rules handle the ~90% of text they're confident about instantly; only flagged spans go to an expensive neural normalizer (RUNorm) or an LLM. You pay for the slow path only where it actually helps. - Corpus preprocessing. Normalize a training/eval corpus deterministically and reproducibly, with no model weights or API calls in the loop.
- Drop-in CLI filter in shell pipelines.
What it normalizes
| Input → output | |
|---|---|
| Cardinals (any size) | 1 234 567 → «один миллион двести тридцать четыре тысячи пятьсот шестьдесят семь» |
| Ordinals (suffix / Roman) | 1-й → «первый», XIX → «девятнадцатый» |
| Dates | 05.08.2008 → «пятое августа две тысячи восьмого года», 2008 г. → «две тысячи восьмой год» |
| Times | 06:06 → «шесть часов шесть минут», 1:15 → «час пятнадцать минут», 2PM → «два часа дня» |
| Money | 543.21 RUB → «пятьсот сорок три рубля двадцать одна копейка», $1 млрд → «один миллиард долларов» |
| Units (count agreement) | 5 кг → «пять килограммов», 90 км/ч → «девяносто километров в час», 7 км. → «семь километров» |
| Multipliers | 5 млн → «пять миллионов», 24,9 млрд руб. → «двадцать четыре целых и девять десятых миллиарда рублей» |
| Decimals & percent | 1,2 → «одна целая и две десятых», 50% → «пятьдесят процентов», 938,00 → «девятьсот тридцать восемь» |
| Fractions | 2/3 → «две третьих», 1/2 → «одна вторая», ½ → «одна вторая» |
| Context-governed case | около 500 км → «около пятисот километров», с 500 рублями → «с пятьюстами рублями» |
| Trigger nouns | 2 место → «второе место», 5 этаж → «пятый этаж» |
| Compound adjectives | 25-этажный → «двадцатипятиэтажный» |
| Abbreviations | и т.д. → «и так далее», ст. 158 → «статья сто пятьдесят восемь» |
| Acronyms | СССР → «эс эс эс эр» (vowel-less spelled out), НАТО kept as a word |
| Latin / mixed | Google → «гугл», GPS → «джи пи эс», example.com → «ексампле точка ком» |
| Symbols | & → «и», ² → «в квадрате», °C, №, Greek letters |
| ё restoration | еще → «ещё» (unambiguous words only) |
Vocabularies are embedded in the single file. The abbreviation and unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); only single-sense entries are kept and the spoken forms were rewritten and checked by hand.
Knowing when to defer: flag_uncertain
normalize_russian always returns its best guess. flag_uncertain(text) tells you
where that guess rests on information the text doesn't contain — so a caller can
escalate those spans (or the whole sentence) to a stronger method and trust the
rest.
from rutextnorm import flag_uncertain
text = "Доктор Smith открыл том XIV на с. 42."
for start, end, original, reason in flag_uncertain(text):
print(f"{original!r:14}{reason}")
'Smith' foreign word (transliteration is approximate)
'XIV' Roman numeral (case defaults to nominative)
'с.' ambiguous abbreviation (секунда / страница / село / с (предлог))
It reads the input only (never a reference), runs in ~0.02 ms/sentence, and detects five structural ambiguities:
| Detector | Why it's uncertain |
|---|---|
| Foreign words | transliteration is approximate; exact pronunciation needs G2P |
Multi-sense abbreviations (г. в. с. кв. …) |
several expansions; only context disambiguates |
| Roman numerals | case is context-dependent; read in the nominative by default |
| Four-digit year-or-cardinal | 1998 could be a year or a count |
| Bare numbers with no cue | grammatical case / cardinal-vs-ordinal undetermined |
Each span carries a reason, so a cost-sensitive caller can ignore the reason
types it doesn't care about (e.g. trust foreign-word transliteration and drop those
flags). A minimal router:
def normalize_or_escalate(text, escalate):
spans = flag_uncertain(text)
if spans:
return escalate(text) # neural model / LLM
return normalize_russian(text) # fast path
Metrics
Measured against ru_2026.csv — the Google/Kaggle Russian normalization gold
(ru_train.csv,
10.6M tokens) with its dataset artifacts removed (per-letter spelling markers,
sil tokens). Comparison is ё-insensitive (the module keeps ё, the gold drops it)
and space-folded (the gold space-separates transliterated foreign words, e.g.
т и б е р и у с, where the module writes the joined word).
acc = exact match; rej = fraction flag_uncertain escalates; trusted =
accuracy on the non-escalated part — the number a hybrid pipeline actually ships.
| Class | Share | acc | rej | trusted | Residual is… |
|---|---|---|---|---|---|
| PLAIN | 70% | 95.1% | 7% | 99.5% | foreign words (gold spells per-letter) |
| PUNCT | 21% | 100% | 0% | 100% | — |
| CARDINAL | 2.6% | 77.2% | 97% | 67.9% | oblique case of bare numbers (needs context) |
| DATE | 1.7% | 86.1% | 47% | 94.5% | year case; ambiguous day case |
| LETTERS | 1.8% | 23.6% | 28% | 32.7% | acronyms read as words, not bare letters (deliberate) |
| VERBATIM | 1.5% | 95.5% | 0% | 95.5% | symbol / Greek map |
| ORDINAL | 0.4% | 40.4% | 67% | 88.3% | bare-number ordinals (need context) |
| MEASURE | 0.4% | 59.5% | 12% | 63.4% | oblique case agreement |
| MONEY | <0.1% | 45.9% | 37% | 52.9% | case agreement; долларов США artifact |
| DECIMAL | <0.1% | 58.6% | 3% | 59.6% | oblique case agreement |
| FRACTION | <0.1% | 77.9% | 98% | 100% | context-dependent case |
| TIME | <0.1% | 87.6% | 5% | 90.5% | oblique case; HH:MM:SS kept by gold |
| Overall | 100% | 93.7% | 9.1% | 98.2% |
Reading the router story: escalating the 9.1% of tokens flag_uncertain
marks lifts the trusted accuracy from 93.7% to 98.2%, catching ~75% of all
errors. Measured per sentence (the router's real setting, with full context)
the figures are 93.8% / 97.9% trusted at 8.5% escalation.
The remaining error is dominated by two things rules can't fix without a token
classifier or sentence context — grammatical case of bare numbers and a few
deliberate divergences (next section) — both of which flag_uncertain is
designed to route away. The benchmark is a regression guard, not a target.
The evaluation harnesses (
eval_reject.pytoken-level,eval_reject_sent.pysentence-level), the regression tests (test_russian.py), the extension eval set and the dataset-cleaning script live on dev branches (ru-2.0-alpha); this branch ships only the module. To reproduce:python3 eval_reject.py ru_2026.csv.
Design choices & gotchas
These are intentional. Where a corpus's written form and a synthesizer's spoken needs disagree, the module picks speech.
- Feed it whole sentences, not pre-split tokens. The context rules (case after a
preposition,
годafter a year, a unit after a number) only fire when the surrounding words are present. Normalizing isolated tokens silently disables them. - ё is kept in the output (
нашёл,ещё) — it carries pronunciation. If you diff against a corpus that writes onlyе, compare ё/е-insensitively. - A bare number's case defaults to nominative.
5 километров, notпяти километрах— the rules can't know the governing case without a cue in the text.flag_uncertainmarks these; give context or route them. - Dates read the day in the genitive and the year in the nominative by default
(
13 сентября→ «тринадцатого сентября»,2008 г.→ «две тысячи восьмой год`). Both are the citation-form defaults; the actual case is context-dependent. - Foreign words are transliterated as one word (
Google→ «гугл»), not spelled by English letter names. Good enough for most TTS;flag_uncertainflags them if you need exact G2P. - Cyrillic acronyms use a vowel heuristic: vowel-less → letter-by-letter
(
СССР→ «эс эс эс эр»), pronounceable → kept (НАТО). Exceptions likeСША(spelled out despite vowels) need a pronunciation lexicon and aren't bundled. - Multi-sense abbreviations are left untouched (
кв.,г.,т.standing alone) — they have several expansions.flag_uncertainmarks them. - Phone/ISBN numbers are read as plain cardinals (not segmented), and
HH:MM:SStimes are expanded.
Known limitations (need sentence context or a classifier — out of scope)
- Grammatical case agreement of a bare number (
500 км→ «пятисот километров`). - Disambiguating a bare number as cardinal vs. ordinal vs. year.
- Telephone / ISBN segmentation and full URL G2P.
- Context-dependent abbreviations (
г.→ год/город,кв.→ квартира/квартал). - Acronyms read as letters despite vowels (
США).
For these, the intended pattern is flag_uncertain → escalate to a neural
normalizer or LLM.
API
normalize_russian(text: str) -> str
Normalize a string (sentence, paragraph, or document). Idempotent on already-spoken text.
flag_uncertain(text: str) -> list[tuple[int, int, str, str]]
Return (start, end, original, reason) spans where the normalization is an
unverifiable guess. Empty list = high confidence in the whole string. Offsets index
the input.
Contributing
Found a case it reads wrong? PRs and issues welcome — please include the input, the
current output, and the form a Russian TTS should say. Behavioural changes should
come with a regression test (test_russian.py on the ru-2.0-alpha branch).
If you improve the solution, please contribute the fix back here too.
License
MIT (see LICENSE). The embedded abbreviation/unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); spoken forms were rewritten by hand.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rutextnorm-2.1.0.tar.gz.
File metadata
- Download URL: rutextnorm-2.1.0.tar.gz
- Upload date:
- Size: 29.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a795603bde5e658929473b5f1ac0662d49a115fdb127b3312105df57bef8989
|
|
| MD5 |
387fd1211a683bc6945fbcdd052988f7
|
|
| BLAKE2b-256 |
2206b7ea1d16cde66eedbc61bac11ec7cd2d0b230cd179f3fdcea9a425c1368a
|
File details
Details for the file rutextnorm-2.1.0-py3-none-any.whl.
File metadata
- Download URL: rutextnorm-2.1.0-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dd1b001981e4b3fab04c30386e58a807218a580c20d5fa9085b26ef8f70ba03
|
|
| MD5 |
1deede163fc72d8ed8a10b9d91a633ce
|
|
| BLAKE2b-256 |
a1a44d44cde792ff06f42e20ec29bbf790eb8048d0939c36d5c84cc47de52614
|