Bulgarian text normalization for TTS — converts numbers, dates, currency, abbreviations to spoken form
Project description
Bulgarian Text Normalizer for TTS
A comprehensive text normalization package that converts written Bulgarian text into its spoken form, designed as a preprocessing step for Text-to-Speech (TTS) systems.
Features
| Category | Examples |
|---|---|
| Numbers | 1500 → хиляда и петстотин |
| Dates | 15.02.2026 г. → петнадесети февруари две хиляди двадесет и шеста година |
| Time | 14:30 ч. → четиринадесет и тридесет часа |
| Currency | 99.99 лв. → деветдесет и девет лева и деветдесет и девет стотинки |
| Percentages | 15.5% → петнадесет цяло и пет десети процента |
| Ordinals | 21-ви → двадесет и първи |
| Abbreviations | бул. Витоша, гр. София → булевард Витоша, град София |
| Phone numbers | +359 888 123 456 → digit-by-digit reading |
| Roman numerals | век XXI → век двадесет и първи |
| Symbols | №10 → номер десет |
Grammatical Correctness
- Gender agreement: Handles masculine/feminine/neuter (
един/една/едно,два/две) - Ordinal forms: Full gender-aware ordinals (
първи/първа/първо) - Year reading: Ordinal feminine form matching "година" (
две хиляди двадесет и шеста) - Space-separated thousands:
7 000 000→седем милиона
Usage
Quick usage
from bg_text_normalizer import normalize_text
result = normalize_text("На 15.02.2026 г. в 14:30 ч. цената е 1500.50 лв.")
# "На петнадесети февруари две хиляди двадесет и шеста година в четиринадесет
# и тридесет часа цената е хиляда и петстотин лева и петдесет стотинки."
Class-based usage
from bg_text_normalizer import BulgarianTextNormalizer
normalizer = BulgarianTextNormalizer(expand_abbrevs=True, verbose=False)
result = normalizer.normalize("бул. Витоша №10, гр. София")
# "булевард Витоша номер десет, град София"
Individual modules
from bg_text_normalizer.bg_numbers import number_to_words_cardinal, number_to_words_ordinal
from bg_text_normalizer.bg_dates import normalize_date
from bg_text_normalizer.bg_currency import normalize_currency
number_to_words_cardinal(2500, gender='m') # "две хиляди и петстотин"
number_to_words_ordinal(15, gender='m') # "петнадесети"
normalize_date(15, 2, 2026) # "петнадесети февруари две хиляди двадесет и шеста"
normalize_currency("99.99", "BGN") # "деветдесет и девет лева и деветдесет и девет стотинки"
Integration with TTS Training (Qwen3-TTS)
Use this normalizer as a preprocessing step when preparing your training data:
import json
from bg_text_normalizer import normalize_text
# Process your JSONL training data
with open('raw_data.jsonl', 'r') as f_in, open('normalized_data.jsonl', 'w') as f_out:
for line in f_in:
entry = json.loads(line)
entry['text'] = normalize_text(entry['text'])
f_out.write(json.dumps(entry, ensure_ascii=False) + '\n')
For inference (runtime TTS), add normalization before synthesis:
from bg_text_normalizer import normalize_text
def synthesize(text: str):
normalized = normalize_text(text)
# ... pass normalized text to TTS model
File Structure
bg-text-normalizer/
├── src/
│ └── bg_text_normalizer/
│ ├── __init__.py # Package entry point
│ ├── bg_normalizer.py # Main orchestrator
│ ├── bg_numbers.py # Cardinal, ordinal, decimal numbers
│ ├── bg_dates.py # Date normalization
│ ├── bg_time.py # Time normalization
│ ├── bg_currency.py # Currency (BGN, EUR, USD, GBP)
│ ├── bg_abbreviations.py # 100+ Bulgarian abbreviations
│ ├── bg_phone.py # Phone number reading
│ └── bg_roman.py # Roman numeral conversion
├── test_normalizer.py # Test suite
├── pyproject.toml
└── README.md
Adding Custom Abbreviations
Edit src/bg_text_normalizer/bg_abbreviations.py and add entries to the appropriate dictionary:
# In ADDRESS_ABBREVS, TITLE_ABBREVS, etc.
CUSTOM_ABBREVS = {
'your_abbrev.': 'пълна форма',
}
Dependencies
None — pure Python, no external dependencies required.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bg_text_normalizer-1.0.0.tar.gz.
File metadata
- Download URL: bg_text_normalizer-1.0.0.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d3257d1b2adb2534ea784c797760c0836c79b6aecf3b7389ab1aa3cf23bc8e4
|
|
| MD5 |
66651d8cad5794db2c46d2aa62d33904
|
|
| BLAKE2b-256 |
2f5ae8bcfcaa86a377906dbeea07992d5856cfb64af6a21897d36f4132250cf1
|
File details
Details for the file bg_text_normalizer-1.0.0-py3-none-any.whl.
File metadata
- Download URL: bg_text_normalizer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9fec9324d96447b01eb21fcbe5abbe9a668048a6f14e271a8ac11c516bc84f2
|
|
| MD5 |
5724c4c238fe21f28014840de7410348
|
|
| BLAKE2b-256 |
9b072f54fc244e80c12f8486b204a1a7553988b2d3674c074d6c9318767c5702
|