Bulgarian text normalization for TTS — converts numbers, dates, currency, abbreviations to spoken form

These details have not been verified by PyPI

Project links

Project description

Bulgarian Text Normalizer for TTS

A comprehensive text normalization package that converts written Bulgarian text into its spoken form, designed as a preprocessing step for Text-to-Speech (TTS) systems.

Features

Category	Examples
Numbers	`1500` → `хиляда и петстотин`
Dates	`15.02.2026 г.` → `петнадесети февруари две хиляди двадесет и шеста година`
Time	`14:30 ч.` → `четиринадесет и тридесет часа`
Currency	`99.99 лв.` → `деветдесет и девет лева и деветдесет и девет стотинки`
Percentages	`15.5%` → `петнадесет цяло и пет десети процента`
Ordinals	`21-ви` → `двадесет и първи`
Abbreviations	`бул. Витоша, гр. София` → `булевард Витоша, град София`
Phone numbers	`+359 888 123 456` → digit-by-digit reading
Roman numerals	`век XXI` → `век двадесет и първи`
Symbols	`№10` → `номер десет`

Grammatical Correctness

Gender agreement: Handles masculine/feminine/neuter (един/една/едно, два/две)
Ordinal forms: Full gender-aware ordinals (първи/първа/първо)
Year reading: Ordinal feminine form matching "година" (две хиляди двадесет и шеста)
Space-separated thousands: 7 000 000 → седем милиона

Usage

Quick usage

from bg_text_normalizer import normalize_text

result = normalize_text("На 15.02.2026 г. в 14:30 ч. цената е 1500.50 лв.")
# "На петнадесети февруари две хиляди двадесет и шеста година в четиринадесет
#  и тридесет часа цената е хиляда и петстотин лева и петдесет стотинки."

Class-based usage

from bg_text_normalizer import BulgarianTextNormalizer

normalizer = BulgarianTextNormalizer(expand_abbrevs=True, verbose=False)
result = normalizer.normalize("бул. Витоша №10, гр. София")
# "булевард Витоша номер десет, град София"

Individual modules

from bg_text_normalizer.bg_numbers import number_to_words_cardinal, number_to_words_ordinal
from bg_text_normalizer.bg_dates import normalize_date
from bg_text_normalizer.bg_currency import normalize_currency

number_to_words_cardinal(2500, gender='m')    # "две хиляди и петстотин"
number_to_words_ordinal(15, gender='m')       # "петнадесети"
normalize_date(15, 2, 2026)                   # "петнадесети февруари две хиляди двадесет и шеста"
normalize_currency("99.99", "BGN")            # "деветдесет и девет лева и деветдесет и девет стотинки"

Integration with TTS Training (Qwen3-TTS)

Use this normalizer as a preprocessing step when preparing your training data:

import json
from bg_text_normalizer import normalize_text

# Process your JSONL training data
with open('raw_data.jsonl', 'r') as f_in, open('normalized_data.jsonl', 'w') as f_out:
    for line in f_in:
        entry = json.loads(line)
        entry['text'] = normalize_text(entry['text'])
        f_out.write(json.dumps(entry, ensure_ascii=False) + '\n')

For inference (runtime TTS), add normalization before synthesis:

from bg_text_normalizer import normalize_text

def synthesize(text: str):
    normalized = normalize_text(text)
    # ... pass normalized text to TTS model

File Structure

bg-text-normalizer/
├── src/
│   └── bg_text_normalizer/
│       ├── __init__.py           # Package entry point
│       ├── bg_normalizer.py      # Main orchestrator
│       ├── bg_numbers.py         # Cardinal, ordinal, decimal numbers
│       ├── bg_dates.py           # Date normalization
│       ├── bg_time.py            # Time normalization
│       ├── bg_currency.py        # Currency (BGN, EUR, USD, GBP)
│       ├── bg_abbreviations.py   # 100+ Bulgarian abbreviations
│       ├── bg_phone.py           # Phone number reading
│       └── bg_roman.py           # Roman numeral conversion
├── test_normalizer.py            # Test suite
├── pyproject.toml
└── README.md

Adding Custom Abbreviations

Edit src/bg_text_normalizer/bg_abbreviations.py and add entries to the appropriate dictionary:

# In ADDRESS_ABBREVS, TITLE_ABBREVS, etc.
CUSTOM_ABBREVS = {
    'your_abbrev.': 'пълна форма',
}

Dependencies

None — pure Python, no external dependencies required.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Feb 28, 2026

This version

1.0.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bg_text_normalizer-1.0.0.tar.gz (19.1 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bg_text_normalizer-1.0.0-py3-none-any.whl (21.0 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file bg_text_normalizer-1.0.0.tar.gz.

File metadata

Download URL: bg_text_normalizer-1.0.0.tar.gz
Upload date: Feb 24, 2026
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for bg_text_normalizer-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`7d3257d1b2adb2534ea784c797760c0836c79b6aecf3b7389ab1aa3cf23bc8e4`
MD5	`66651d8cad5794db2c46d2aa62d33904`
BLAKE2b-256	`2f5ae8bcfcaa86a377906dbeea07992d5856cfb64af6a21897d36f4132250cf1`

See more details on using hashes here.

File details

Details for the file bg_text_normalizer-1.0.0-py3-none-any.whl.

File metadata

Download URL: bg_text_normalizer-1.0.0-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for bg_text_normalizer-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9fec9324d96447b01eb21fcbe5abbe9a668048a6f14e271a8ac11c516bc84f2`
MD5	`5724c4c238fe21f28014840de7410348`
BLAKE2b-256	`9b072f54fc244e80c12f8486b204a1a7553988b2d3674c074d6c9318767c5702`

See more details on using hashes here.

bg-text-normalizer 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bulgarian Text Normalizer for TTS

Features

Grammatical Correctness

Usage

Quick usage

Class-based usage

Individual modules

Integration with TTS Training (Qwen3-TTS)

File Structure

Adding Custom Abbreviations

Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes