Skip to main content

Bulgarian text normalization for TTS — converts numbers, dates, currency, abbreviations to spoken form

Project description

Bulgarian Text Normalizer for TTS

A comprehensive text normalization package that converts written Bulgarian text into its spoken form, designed as a preprocessing step for Text-to-Speech (TTS) systems.

Features

Category Examples
Numbers 1500хиляда и петстотин
Dates 15.02.2026 г.петнадесети февруари две хиляди двадесет и шеста година
Time 14:30 ч.четиринадесет и тридесет часа
Currency 99.99 лв.деветдесет и девет лева и деветдесет и девет стотинки
Percentages 15.5%петнадесет цяло и пет десети процента
Ordinals 21-видвадесет и първи
Abbreviations бул. Витоша, гр. Софиябулевард Витоша, град София
Phone numbers +359 888 123 456 → digit-by-digit reading
Roman numerals век XXIвек двадесет и първи
Symbols №10номер десет

Grammatical Correctness

  • Gender agreement: Handles masculine/feminine/neuter (един/една/едно, два/две)
  • Ordinal forms: Full gender-aware ordinals (първи/първа/първо)
  • Year reading: Ordinal feminine form matching "година" (две хиляди двадесет и шеста)
  • Space-separated thousands: 7 000 000седем милиона

Usage

Quick usage

from bg_text_normalizer import normalize_text

result = normalize_text("На 15.02.2026 г. в 14:30 ч. цената е 1500.50 лв.")
# "На петнадесети февруари две хиляди двадесет и шеста година в четиринадесет
#  и тридесет часа цената е хиляда и петстотин лева и петдесет стотинки."

Class-based usage

from bg_text_normalizer import BulgarianTextNormalizer

normalizer = BulgarianTextNormalizer(expand_abbrevs=True, verbose=False)
result = normalizer.normalize("бул. Витоша №10, гр. София")
# "булевард Витоша номер десет, град София"

Individual modules

from bg_text_normalizer.bg_numbers import number_to_words_cardinal, number_to_words_ordinal
from bg_text_normalizer.bg_dates import normalize_date
from bg_text_normalizer.bg_currency import normalize_currency

number_to_words_cardinal(2500, gender='m')    # "две хиляди и петстотин"
number_to_words_ordinal(15, gender='m')       # "петнадесети"
normalize_date(15, 2, 2026)                   # "петнадесети февруари две хиляди двадесет и шеста"
normalize_currency("99.99", "BGN")            # "деветдесет и девет лева и деветдесет и девет стотинки"

Integration with TTS Training (Qwen3-TTS)

Use this normalizer as a preprocessing step when preparing your training data:

import json
from bg_text_normalizer import normalize_text

# Process your JSONL training data
with open('raw_data.jsonl', 'r') as f_in, open('normalized_data.jsonl', 'w') as f_out:
    for line in f_in:
        entry = json.loads(line)
        entry['text'] = normalize_text(entry['text'])
        f_out.write(json.dumps(entry, ensure_ascii=False) + '\n')

For inference (runtime TTS), add normalization before synthesis:

from bg_text_normalizer import normalize_text

def synthesize(text: str):
    normalized = normalize_text(text)
    # ... pass normalized text to TTS model

File Structure

bg-text-normalizer/
├── src/
│   └── bg_text_normalizer/
│       ├── __init__.py           # Package entry point
│       ├── bg_normalizer.py      # Main orchestrator
│       ├── bg_numbers.py         # Cardinal, ordinal, decimal numbers
│       ├── bg_dates.py           # Date normalization
│       ├── bg_time.py            # Time normalization
│       ├── bg_currency.py        # Currency (BGN, EUR, USD, GBP)
│       ├── bg_abbreviations.py   # 100+ Bulgarian abbreviations
│       ├── bg_phone.py           # Phone number reading
│       └── bg_roman.py           # Roman numeral conversion
├── test_normalizer.py            # Test suite
├── pyproject.toml
└── README.md

Adding Custom Abbreviations

Edit src/bg_text_normalizer/bg_abbreviations.py and add entries to the appropriate dictionary:

# In ADDRESS_ABBREVS, TITLE_ABBREVS, etc.
CUSTOM_ABBREVS = {
    'your_abbrev.': 'пълна форма',
}

Dependencies

None — pure Python, no external dependencies required.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bg_text_normalizer-1.0.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bg_text_normalizer-1.0.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file bg_text_normalizer-1.0.0.tar.gz.

File metadata

  • Download URL: bg_text_normalizer-1.0.0.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for bg_text_normalizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7d3257d1b2adb2534ea784c797760c0836c79b6aecf3b7389ab1aa3cf23bc8e4
MD5 66651d8cad5794db2c46d2aa62d33904
BLAKE2b-256 2f5ae8bcfcaa86a377906dbeea07992d5856cfb64af6a21897d36f4132250cf1

See more details on using hashes here.

File details

Details for the file bg_text_normalizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bg_text_normalizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9fec9324d96447b01eb21fcbe5abbe9a668048a6f14e271a8ac11c516bc84f2
MD5 5724c4c238fe21f28014840de7410348
BLAKE2b-256 9b072f54fc244e80c12f8486b204a1a7553988b2d3674c074d6c9318767c5702

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page