Russian text normalization for TTS: numbers, dates, currency, units, case agreement, plus an uncertainty router - one file, regex only, no dependencies

These details have not been verified by PyPI

Project links

Homepage

Project description

Russian text normalization for TTS

Normalize Text in Russian.

Install: pip install rutextnorm, or just copy rutextnorm.py (a single self-contained file, no dependencies) into the text folder of your TTS system. It can also be used as a command-line filter: echo "цена 1 500 руб." | python3 rutextnorm.py.

The package and the import name are the same (rutextnorm). When vendored as a plain file the import follows wherever you put it (e.g. from text.rutextnorm import normalize_russian).

from rutextnorm import normalize_russian

complex_test_text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
В моем кошельке было 876 UAH и 543.21 RUB, а также я нашел 20 центов."""

normalized_text = normalize_russian(complex_test_text)
print(normalized_text)

Prints:

У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.
В моем кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать одна копейка, а также я нашёл двадцать центов.

Knowing when to defer (`flag_uncertain`)

normalize_russian always returns its best guess. For production use you often want to know when that guess is unreliable — so you can route just those cases to a stronger (and slower) method such as an LLM or neural normalizer, and trust the fast rules everywhere else. flag_uncertain(text) returns the spans where the output rests on information the rules cannot recover:

from rutextnorm import normalize_russian, flag_uncertain

text = "В томе III книги Smithsonian на с. 42 есть 1998 интересных фактов."
spans = flag_uncertain(text)          # [(start, end, original, reason), ...]
if spans:
    # hand `text` (or just these spans) to the better method
    for start, end, original, reason in spans:
        print(f"{original!r}: {reason}")
# 'III': Roman numeral (case defaults to nominative)
# 'Smithsonian': foreign word (transliteration is approximate)
# 'с.': ambiguous abbreviation (секунда / страница / село / с (предлог))
# '1998': four-digit number (year or cardinal?)

It reads the input only (never the reference), runs in ~0.02 ms/sentence, and detects five structural ambiguities: foreign words, multi-sense abbreviations (г./в./с.…), Roman numerals, four-digit year-or-cardinal numbers, and context-free bare numbers. Measured on the Kaggle gold with sentence-level routing: escalating ~39% of sentences lifts the trusted fast-path from 91.5% to 98.5% token accuracy and routes 90% of all errors to the better method. Each span carries a reason, so a cost-sensitive caller can ignore reason types it does not care about (dropping the bare-number flags, for instance, cuts escalation to ~35% with almost the same trusted accuracy).

Implemented

Cyrrilization of letters such as "apple" -> "эппл".
Abbreviations expansion such as "СССР" -> "эс эс эс эр".
Numbers conversion of any size
Currency expansion
Phone number expansion
Dates: "1862 год", "12 февраля 2013", "05.08.2008" -> ordinal year/day reading
Ordinals with a suffix ("1-й" -> "первый") and Roman numerals ("XIX" -> "девятнадцатого")
Decimals: "1,2" -> "одна целая и две десятых"; percentages: "50%" -> "пятьдесят процентов"
Fractions: "2/3" -> "две третьих"
Clock times: "06:06" -> "шесть часов шесть минут"
Digit strings with a leading zero: "06" -> "ноль шесть"
Symbols / foreign letters by name: "&" -> "и", "²" -> "в квадрате", "°C", Greek
Space/NBSP-grouped thousands: "1 234 567" -> one number; negatives: "-5" -> "минус пять"
Quantity multipliers: "5 млн" -> "пять миллионов" (agrees with the number)
Units of measure: "5 кг" -> "пять килограммов", "90 км/ч" -> "...в час", "5 ГБ", "25°"
Textual abbreviations: "и т.д." -> "и так далее"
Acronyms: vowel-less spelled out ("СССР" -> "эс эс эс эр"), pronounceable kept as-is ("НАТО")
E-mail/URL spell-out: "example.com" -> "ексампле точка ком"
Dotted units ("82 т." -> "восемьдесят две тонны") and decimal counts ("1,5 км" -> "...километра")
Years with г./гг. and ranges: "2008 г." -> "две тысячи восьмой год", "1941—1945 гг.", "XIX–XX вв."
Context-governed case: "около 500 км" -> "около пятисот километров", "к 5" -> "к пяти", "с 500 рублями" -> "с пятьюстами рублями" (closed-class cardinal declension tables)
Ordinal trigger nouns: "2 место" -> "второе место", "5 этаж" -> "пятый этаж"
Compound number adjectives: "25-этажный" -> "двадцатипятиэтажный"
Times beyond HH:MM: "02:25:00", "2PM" -> "два часа дня"; scores: "3:1" -> "три один"
Versions/IP: "Python 3.11" -> "питон три точка одиннадцать", "192.168.1.1"
Structural refs before a number: "ст. 158" -> "статья сто пятьдесят восемь"; math: "2+2=4"
English word dictionary: "Google" -> "гугл"; Latin acronyms by English letter name: "GPS" -> "джи пи эс"
ё restoration (unambiguous words only): "еще" -> "ещё"; hashtags: "#новости" -> "хештег новости"

Notes:

The letter ё is kept in the output (it carries pronunciation for TTS).
Vocabularies are embedded in rutextnorm.py (single-file module). Abbreviations come from NVIDIA NeMo-text-processing (ru/whitelist.tsv, Apache-2.0); only single-sense entries are used.

Validation

Tested against the Google/Kaggle Russian text-normalization set (ru_train.csv, 10,574,516 tokens). Each token's input is normalized in isolation and compared to the gold output; "accuracy" is exact string match, compared ё/е-insensitively (the reference data writes only е, this script keeps ё). "Original" is the script before these changes.

The evaluation harness (eval_assess.py, eval_extension.csv), regression tests and the dataset-cleaning script live on the ru-2.0-alpha branch; this branch ships only the module itself.

Domain (class)	Tokens	Original acc.	Current acc.	Notes
PLAIN	7,360,439	69.9%	92.5%	residual: Latin spelled per-letter in gold
PUNCT	2,288,640	100.0%	100.0%	passthrough
CARDINAL	272,442	51.2%	77.0%	residual: oblique case of bare numbers (no context in token)
LETTERS	189,528	0.8%	0.0%	not targeted (gold uses bare letters, worse for TTS)
DATE	185,961	0.0%	86.2%	residual: bare years, ambiguous day-case
VERBATIM	157,912	91.1%	95.7%	symbol / Greek map
ORDINAL	46,738	0.0%	40.6%	residual: bare-number ordinals (need context)
MEASURE	40,537	3.1%	50.8%	residual: oblique case agreement
TELEPHONE	10,088	0.3%	1.3%	not targeted (irregular ISBN grouping)
DECIMAL	7,299	6.1%	54.3%	residual: oblique case agreement
ELECTRONIC	5,832	2.6%	2.8%	not targeted (English G2P + markers)
MONEY	2,690	14.4%	34.0%	residual: case agreement, "долларов сэ ш а" artifact
FRACTION	2,460	0.0%	66.0%	residual: context-dependent case
DIGIT	2,012	0.0%	100.0%	leading-zero digit strings
TIME	1,949	0.0%	85.3%	residual: oblique case, timezone suffixes
Overall	10,574,570	73.0%	91.4%	exact-match token accuracy (incl. eval_extension rows)

The remaining error is dominated by things rules cannot resolve without a token classifier or sentence context. Case agreement is now rule-handled when the context is inside the token (около 500 км -> около пятисот километров, preposition- and noun-ending-governed), but the gold set scores tokens in isolation, where a bare 500 км gives no case signal. Likewise disambiguating a bare number as cardinal/ordinal/year, and classes left untargeted on purpose (LETTERS, TELEPHONE, ELECTRONIC). The test set is treated as a regression guard, not a target — some choices (keeping ё, reading acronyms as words, nominative Roman numerals) favour TTS quality over this score.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.1.0

Jun 13, 2026

This version

1.1.0

Jun 13, 2026

1.0.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rutextnorm-1.1.0.tar.gz (25.9 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rutextnorm-1.1.0-py3-none-any.whl (26.2 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file rutextnorm-1.1.0.tar.gz.

File metadata

Download URL: rutextnorm-1.1.0.tar.gz
Upload date: Jun 13, 2026
Size: 25.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rutextnorm-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bc5d990736b70a5e5fb6a00143d7b75abbfa7310910def343c04c678a9e34d93`
MD5	`db47a709e3b3f5788044a67f66f1ef29`
BLAKE2b-256	`989f802e35179ae3978ac91a2793dc2418dabd3b2b33ab5c35bfa302bc7609e5`

See more details on using hashes here.

File details

Details for the file rutextnorm-1.1.0-py3-none-any.whl.

File metadata

Download URL: rutextnorm-1.1.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 26.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rutextnorm-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2076d9aef429ece83912bf172df346efd91564f2a52c3992a8dc1b2b9974303c`
MD5	`d17a7091f42ea7aeed3b62a0e81f1b50`
BLAKE2b-256	`4b582627310003af63173abc9c4d51f87d0ed551ac4e2e2d77ffb52b634b886e`

See more details on using hashes here.

rutextnorm 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Russian text normalization for TTS

Knowing when to defer (`flag_uncertain`)

Implemented

Validation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

rutextnorm 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Russian text normalization for TTS

Knowing when to defer (flag_uncertain)

Implemented

Validation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Knowing when to defer (`flag_uncertain`)