Russian text normalization for TTS: numbers, dates, currency, units, case agreement - one file, regex only, no dependencies
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
Russian text normalization for TTS
Normalize Text in Russian.
Install: pip install russian-tts-normalization, or just copy russian.py
(a single self-contained file, no dependencies) into the text folder of your
TTS system. It can also be used as a command-line filter:
echo "цена 1 500 руб." | python3 russian.py.
Note the name difference: the PyPI package is russian-tts-normalization,
but it installs a single top-level module named russian — the import name
matches the file, because the file is designed to also be vendored as
russian.py next to your TTS code (in which case the import becomes
from text.russian import normalize_russian or similar).
from russian import normalize_russian
complex_test_text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
В моем кошельке было 876 UAH и 543.21 RUB, а также я нашел 20 центов."""
normalized_text = normalize_russian(complex_test_text)
print(normalized_text)
Prints:
У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.
В моем кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать одна копейка, а также я нашёл двадцать центов.
Implemented
- Cyrrilization of letters such as "apple" -> "эппл".
- Abbreviations expansion such as "СССР" -> "эс эс эс эр".
- Numbers conversion of any size
- Currency expansion
- Phone number expansion
- Dates: "1862 год", "12 февраля 2013", "05.08.2008" -> ordinal year/day reading
- Ordinals with a suffix ("1-й" -> "первый") and Roman numerals ("XIX" -> "девятнадцатого")
- Decimals: "1,2" -> "одна целая и две десятых"; percentages: "50%" -> "пятьдесят процентов"
- Fractions: "2/3" -> "две третьих"
- Clock times: "06:06" -> "шесть часов шесть минут"
- Digit strings with a leading zero: "06" -> "ноль шесть"
- Symbols / foreign letters by name: "&" -> "и", "²" -> "в квадрате", "°C", Greek
- Space/NBSP-grouped thousands: "1 234 567" -> one number; negatives: "-5" -> "минус пять"
- Quantity multipliers: "5 млн" -> "пять миллионов" (agrees with the number)
- Units of measure: "5 кг" -> "пять килограммов", "90 км/ч" -> "...в час", "5 ГБ", "25°"
- Textual abbreviations: "и т.д." -> "и так далее"
- Acronyms: vowel-less spelled out ("СССР" -> "эс эс эс эр"), pronounceable kept as-is ("НАТО")
- E-mail/URL spell-out: "example.com" -> "ексампле точка ком"
- Dotted units ("82 т." -> "восемьдесят две тонны") and decimal counts ("1,5 км" -> "...километра")
- Years with г./гг. and ranges: "2008 г." -> "две тысячи восьмой год", "1941—1945 гг.", "XIX–XX вв."
- Context-governed case: "около 500 км" -> "около пятисот километров", "к 5" -> "к пяти", "с 500 рублями" -> "с пятьюстами рублями" (closed-class cardinal declension tables)
- Ordinal trigger nouns: "2 место" -> "второе место", "5 этаж" -> "пятый этаж"
- Compound number adjectives: "25-этажный" -> "двадцатипятиэтажный"
- Times beyond HH:MM: "02:25:00", "2PM" -> "два часа дня"; scores: "3:1" -> "три один"
- Versions/IP: "Python 3.11" -> "питон три точка одиннадцать", "192.168.1.1"
- Structural refs before a number: "ст. 158" -> "статья сто пятьдесят восемь"; math: "2+2=4"
- English word dictionary: "Google" -> "гугл"; Latin acronyms by English letter name: "GPS" -> "джи пи эс"
- ё restoration (unambiguous words only): "еще" -> "ещё"; hashtags: "#новости" -> "хештег новости"
Notes:
- The letter ё is kept in the output (it carries pronunciation for TTS).
- Vocabularies are embedded in
russian.py(single-file module). Abbreviations come from NVIDIA NeMo-text-processing (ru/whitelist.tsv, Apache-2.0); only single-sense entries are used.
Validation
Tested against the Google/Kaggle Russian text-normalization set
(ru_train.csv, 10,574,516 tokens). Each token's input is normalized in
isolation and compared to the gold output; "accuracy" is exact string match,
compared ё/е-insensitively (the reference data writes only е, this script keeps ё).
"Original" is the script before these changes.
The evaluation harness (eval_assess.py, eval_extension.csv), regression
tests and the dataset-cleaning script live on the ru-2.0-alpha branch; this
branch ships only the module itself.
| Domain (class) | Tokens | Original acc. | Current acc. | Notes |
|---|---|---|---|---|
| PLAIN | 7,360,439 | 69.9% | 92.5% | residual: Latin spelled per-letter in gold |
| PUNCT | 2,288,640 | 100.0% | 100.0% | passthrough |
| CARDINAL | 272,442 | 51.2% | 77.0% | residual: oblique case of bare numbers (no context in token) |
| LETTERS | 189,528 | 0.8% | 0.0% | not targeted (gold uses bare letters, worse for TTS) |
| DATE | 185,961 | 0.0% | 86.2% | residual: bare years, ambiguous day-case |
| VERBATIM | 157,912 | 91.1% | 95.7% | symbol / Greek map |
| ORDINAL | 46,738 | 0.0% | 40.6% | residual: bare-number ordinals (need context) |
| MEASURE | 40,537 | 3.1% | 50.8% | residual: oblique case agreement |
| TELEPHONE | 10,088 | 0.3% | 1.3% | not targeted (irregular ISBN grouping) |
| DECIMAL | 7,299 | 6.1% | 54.3% | residual: oblique case agreement |
| ELECTRONIC | 5,832 | 2.6% | 2.8% | not targeted (English G2P + markers) |
| MONEY | 2,690 | 14.4% | 34.0% | residual: case agreement, "долларов сэ ш а" artifact |
| FRACTION | 2,460 | 0.0% | 66.0% | residual: context-dependent case |
| DIGIT | 2,012 | 0.0% | 100.0% | leading-zero digit strings |
| TIME | 1,949 | 0.0% | 85.3% | residual: oblique case, timezone suffixes |
| Overall | 10,574,570 | 73.0% | 91.4% | exact-match token accuracy (incl. eval_extension rows) |
The remaining error is dominated by things rules cannot resolve without a token
classifier or sentence context. Case agreement is now rule-handled when the
context is inside the token (около 500 км -> около пятисот километров,
preposition- and noun-ending-governed), but the gold set scores tokens in
isolation, where a bare 500 км gives no case signal. Likewise
disambiguating a bare number as cardinal/ordinal/year, and
classes left untargeted on purpose (LETTERS, TELEPHONE, ELECTRONIC). The test set
is treated as a regression guard, not a target — some choices (keeping ё, reading
acronyms as words, nominative Roman numerals) favour TTS quality over this score.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file russian_tts_normalization-1.0.2.tar.gz.
File metadata
- Download URL: russian_tts_normalization-1.0.2.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0999c83f5fa4713dcabebe6e775ca772bcd9832df1b472cd4d0791fff81a6961
|
|
| MD5 |
a8f9eff0e452184adc35dfd9ea2d4b5f
|
|
| BLAKE2b-256 |
e9966ad23def0ccaf6d4915a6a0bd960ca65254b88e847b1d196c30e1cc37a29
|
File details
Details for the file russian_tts_normalization-1.0.2-py3-none-any.whl.
File metadata
- Download URL: russian_tts_normalization-1.0.2-py3-none-any.whl
- Upload date:
- Size: 24.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59c5b662220623f7c559c54be2be40de3bd680ee6033b4c61e1f1f67447d3722
|
|
| MD5 |
9761bb4859f4e8c7c876c90bb594309a
|
|
| BLAKE2b-256 |
3ca0f5e5eeeb35e847a7bc7bc74aecca95030fa990ddbd0e5eec6f6f11948e00
|