Russian text normalization for TTS: numbers, dates, currency, units, fractions, case agreement, plus an uncertainty router - one file, regex only, no dependencies

These details have not been verified by PyPI

Project links

Homepage

Project description

rutextnorm — Russian text normalization for TTS

Turn written Russian into something a TTS model can say: numbers, dates, money, units, fractions, times, abbreviations, symbols, and mixed Latin/Cyrillic — all spelled out, in agreement, in words.

"В 2024 году инфляция составила 7,5%, а доходы выросли на 3 млрд руб."
        ↓ normalize_russian()
"В две тысячи двадцать четвёртом году инфляция составила семь целых
 и пять десятых процента, а доходы выросли на три миллиарда рублей"

One file, zero dependencies, no network, no ML. Pure re + lookup tables. Deterministic: same input → same output. ~0.17 ms/sentence (~375k chars/s).
Knows when it might be wrong. flag_uncertain() returns the spans the rules can't resolve from the text, so you can route just those to a slower, stronger method (a neural normalizer or LLM) and trust the fast path everywhere else.
Built for TTS, not for a benchmark. Where speakability and a corpus's written form disagree, it favours what the synthesizer should pronounce (see Design choices & gotchas).

Install

pip install rutextnorm

The PyPI name and the import name are the same:

from rutextnorm import normalize_russian

Or vendor the single file — copy rutextnorm.py straight into your project (e.g. into a TTS repo's text/ folder). Nothing else is required. When vendored, the import follows wherever you put it:

from text.rutextnorm import normalize_russian

Requires Python ≥ 3.8.

Quick start

from rutextnorm import normalize_russian

text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
В моём кошельке было 876 UAH и 543.21 RUB, а также я нашёл 20 центов."""

print(normalize_russian(text))

У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.
В моём кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать одна копейка, а также я нашёл двадцать центов.

Command-line filter

echo "цена 1 500 руб." | python3 -m rutextnorm        # installed
echo "цена 1 500 руб." | python3 rutextnorm.py        # vendored
# -> цена тысяча пятьсот рублей

Use cases

TTS front-end. Run text through normalize_russian before your G2P / acoustic model so the synthesizer never has to guess how to read 7,5% or $3 млрд.
Hybrid pipeline. Use flag_uncertain as a router: the rules handle the ~90% of text they're confident about instantly; only flagged spans go to an expensive neural normalizer (RUNorm) or an LLM. You pay for the slow path only where it actually helps.
Corpus preprocessing. Normalize a training/eval corpus deterministically and reproducibly, with no model weights or API calls in the loop.
Drop-in CLI filter in shell pipelines.

What it normalizes

	Input → output
Cardinals (any size)	`1 234 567` → «один миллион двести тридцать четыре тысячи пятьсот шестьдесят семь»
Ordinals (suffix / Roman)	`1-й` → «первый», `XIX` → «девятнадцатый»
Dates	`05.08.2008` → «пятое августа две тысячи восьмого года», `2008 г.` → «две тысячи восьмой год»
Times	`06:06` → «шесть часов шесть минут», `1:15` → «час пятнадцать минут», `2PM` → «два часа дня»
Money	`543.21 RUB` → «пятьсот сорок три рубля двадцать одна копейка», `$1 млрд` → «один миллиард долларов»
Units (count agreement)	`5 кг` → «пять килограммов», `90 км/ч` → «девяносто километров в час», `7 км.` → «семь километров»
Multipliers	`5 млн` → «пять миллионов», `24,9 млрд руб.` → «двадцать четыре целых и девять десятых миллиарда рублей»
Decimals & percent	`1,2` → «одна целая и две десятых», `50%` → «пятьдесят процентов», `938,00` → «девятьсот тридцать восемь»
Fractions	`2/3` → «две третьих», `1/2` → «одна вторая», `½` → «одна вторая»
Context-governed case	`около 500 км` → «около пятисот километров», `с 500 рублями` → «с пятьюстами рублями»
Trigger nouns	`2 место` → «второе место», `5 этаж` → «пятый этаж»
Compound adjectives	`25-этажный` → «двадцатипятиэтажный»
Abbreviations	`и т.д.` → «и так далее», `ст. 158` → «статья сто пятьдесят восемь»
Acronyms	`СССР` → «эс эс эс эр» (vowel-less spelled out), `НАТО` kept as a word
Latin / mixed	`Google` → «гугл», `GPS` → «джи пи эс», `example.com` → «ексампле точка ком»
Symbols	`&` → «и», `²` → «в квадрате», `°C`, `№`, Greek letters
ё restoration	`еще` → «ещё» (unambiguous words only)

Vocabularies are embedded in the single file. The abbreviation and unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); only single-sense entries are kept and the spoken forms were rewritten and checked by hand.

Knowing when to defer: `flag_uncertain`

normalize_russian always returns its best guess. flag_uncertain(text) tells you where that guess rests on information the text doesn't contain — so a caller can escalate those spans (or the whole sentence) to a stronger method and trust the rest.

from rutextnorm import flag_uncertain

text = "Доктор Smith открыл том XIV на с. 42."
for start, end, original, reason in flag_uncertain(text):
    print(f"{original!r:14}{reason}")

'Smith'       foreign word (transliteration is approximate)
'XIV'         Roman numeral (case defaults to nominative)
'с.'          ambiguous abbreviation (секунда / страница / село / с (предлог))

It reads the input only (never a reference), runs in ~0.02 ms/sentence, and detects five structural ambiguities:

Detector	Why it's uncertain
Foreign words	transliteration is approximate; exact pronunciation needs G2P
Multi-sense abbreviations (`г.` `в.` `с.` `кв.` …)	several expansions; only context disambiguates
Roman numerals	case is context-dependent; read in the nominative by default
Four-digit year-or-cardinal	`1998` could be a year or a count
Bare numbers with no cue	grammatical case / cardinal-vs-ordinal undetermined

Each span carries a reason, so a cost-sensitive caller can ignore the reason types it doesn't care about (e.g. trust foreign-word transliteration and drop those flags). A minimal router:

def normalize_or_escalate(text, escalate):
    spans = flag_uncertain(text)
    if spans:
        return escalate(text)          # neural model / LLM
    return normalize_russian(text)     # fast path

Metrics

Measured against ru_2026.csv — the Google/Kaggle Russian normalization gold (ru_train.csv, 10.6M tokens) with its dataset artifacts removed (per-letter spelling markers, sil tokens). Comparison is ё-insensitive (the module keeps ё, the gold drops it) and space-folded (the gold space-separates transliterated foreign words, e.g. т и б е р и у с, where the module writes the joined word).

acc = exact match; rej = fraction flag_uncertain escalates; trusted = accuracy on the non-escalated part — the number a hybrid pipeline actually ships.

Class	Share	acc	rej	trusted	Residual is…
PLAIN	70%	95.1%	7%	99.5%	foreign words (gold spells per-letter)
PUNCT	21%	100%	0%	100%	—
CARDINAL	2.6%	77.2%	97%	67.9%	oblique case of bare numbers (needs context)
DATE	1.7%	86.1%	47%	94.5%	year case; ambiguous day case
LETTERS	1.8%	23.6%	28%	32.7%	acronyms read as words, not bare letters (deliberate)
VERBATIM	1.5%	95.5%	0%	95.5%	symbol / Greek map
ORDINAL	0.4%	40.4%	67%	88.3%	bare-number ordinals (need context)
MEASURE	0.4%	59.5%	12%	63.4%	oblique case agreement
MONEY	<0.1%	45.9%	37%	52.9%	case agreement; `долларов США` artifact
DECIMAL	<0.1%	58.6%	3%	59.6%	oblique case agreement
FRACTION	<0.1%	77.9%	98%	100%	context-dependent case
TIME	<0.1%	87.6%	5%	90.5%	oblique case; `HH:MM:SS` kept by gold
Overall	100%	93.7%	9.1%	98.2%

Reading the router story: escalating the 9.1% of tokens flag_uncertain marks lifts the trusted accuracy from 93.7% to 98.2%, catching ~75% of all errors. Measured per sentence (the router's real setting, with full context) the figures are 93.8% / 97.9% trusted at 8.5% escalation.

The remaining error is dominated by two things rules can't fix without a token classifier or sentence context — grammatical case of bare numbers and a few deliberate divergences (next section) — both of which flag_uncertain is designed to route away. The benchmark is a regression guard, not a target.

The evaluation harnesses (eval_reject.py token-level, eval_reject_sent.py sentence-level), the regression tests (test_russian.py), the extension eval set and the dataset-cleaning script live on dev branches (ru-2.0-alpha); this branch ships only the module. To reproduce: python3 eval_reject.py ru_2026.csv.

Design choices & gotchas

These are intentional. Where a corpus's written form and a synthesizer's spoken needs disagree, the module picks speech.

Feed it whole sentences, not pre-split tokens. The context rules (case after a preposition, год after a year, a unit after a number) only fire when the surrounding words are present. Normalizing isolated tokens silently disables them.
ё is kept in the output (нашёл, ещё) — it carries pronunciation. If you diff against a corpus that writes only е, compare ё/е-insensitively.
A bare number's case defaults to nominative. 5 километров, not пяти километрах — the rules can't know the governing case without a cue in the text. flag_uncertain marks these; give context or route them.
Dates read the day in the genitive and the year in the nominative by default (13 сентября → «тринадцатого сентября», 2008 г. → «две тысячи восьмой год`). Both are the citation-form defaults; the actual case is context-dependent.
Foreign words are transliterated as one word (Google → «гугл»), not spelled by English letter names. Good enough for most TTS; flag_uncertain flags them if you need exact G2P.
Cyrillic acronyms use a vowel heuristic: vowel-less → letter-by-letter (СССР → «эс эс эс эр»), pronounceable → kept (НАТО). Exceptions like США (spelled out despite vowels) need a pronunciation lexicon and aren't bundled.
Multi-sense abbreviations are left untouched (кв., г., т. standing alone) — they have several expansions. flag_uncertain marks them.
Phone/ISBN numbers are read as plain cardinals (not segmented), and HH:MM:SS times are expanded.

Known limitations (need sentence context or a classifier — out of scope)

Grammatical case agreement of a bare number (500 км → «пятисот километров`).
Disambiguating a bare number as cardinal vs. ordinal vs. year.
Telephone / ISBN segmentation and full URL G2P.
Context-dependent abbreviations (г. → год/город, кв. → квартира/квартал).
Acronyms read as letters despite vowels (США).

For these, the intended pattern is flag_uncertain → escalate to a neural normalizer or LLM.

API

normalize_russian(text: str) -> str

Normalize a string (sentence, paragraph, or document). Idempotent on already-spoken text.

flag_uncertain(text: str) -> list[tuple[int, int, str, str]]

Return (start, end, original, reason) spans where the normalization is an unverifiable guess. Empty list = high confidence in the whole string. Offsets index the input.

Contributing

Found a case it reads wrong? PRs and issues welcome — please include the input, the current output, and the form a Russian TTS should say. Behavioural changes should come with a regression test (test_russian.py on the ru-2.0-alpha branch).

If you improve the solution, please contribute the fix back here too.

License

MIT (see LICENSE). The embedded abbreviation/unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); spoken forms were rewritten by hand.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.1.0

Jun 13, 2026

1.1.0

Jun 13, 2026

1.0.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rutextnorm-2.1.0.tar.gz (29.0 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rutextnorm-2.1.0-py3-none-any.whl (28.9 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file rutextnorm-2.1.0.tar.gz.

File metadata

Download URL: rutextnorm-2.1.0.tar.gz
Upload date: Jun 13, 2026
Size: 29.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rutextnorm-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0a795603bde5e658929473b5f1ac0662d49a115fdb127b3312105df57bef8989`
MD5	`387fd1211a683bc6945fbcdd052988f7`
BLAKE2b-256	`2206b7ea1d16cde66eedbc61bac11ec7cd2d0b230cd179f3fdcea9a425c1368a`

See more details on using hashes here.

File details

Details for the file rutextnorm-2.1.0-py3-none-any.whl.

File metadata

Download URL: rutextnorm-2.1.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rutextnorm-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3dd1b001981e4b3fab04c30386e58a807218a580c20d5fa9085b26ef8f70ba03`
MD5	`1deede163fc72d8ed8a10b9d91a633ce`
BLAKE2b-256	`a1a44d44cde792ff06f42e20ec29bbf790eb8048d0939c36d5c84cc47de52614`

See more details on using hashes here.

rutextnorm 2.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rutextnorm — Russian text normalization for TTS

Install

Quick start

Command-line filter

Use cases

What it normalizes

Knowing when to defer: `flag_uncertain`

Metrics

Design choices & gotchas

Known limitations (need sentence context or a classifier — out of scope)

API

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

rutextnorm 2.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rutextnorm — Russian text normalization for TTS

Install

Quick start

Command-line filter

Use cases

What it normalizes

Knowing when to defer: flag_uncertain

Metrics

Design choices & gotchas

Known limitations (need sentence context or a classifier — out of scope)

API

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Knowing when to defer: `flag_uncertain`