Skip to main content

Bulgarian text normalization for TTS — converts numbers, dates, currency, abbreviations to spoken form

Project description

Bulgarian Text Normalizer for TTS

A comprehensive text normalization package that converts written Bulgarian text into its spoken form, designed as a preprocessing step for Text-to-Speech (TTS) systems.

Features

Category Examples
Numbers 1500хиляда и петстотин
Dates 15.02.2026 г.петнадесети февруари две хиляди двадесет и шеста година
Time 14:30 ч.четиринадесет и тридесет часа
Currency 99.99 лв.деветдесет и девет лева и деветдесет и девет стотинки
Percentages 15.5%петнадесет цяло и пет десети процента
Ordinals 21-видвадесет и първи
Abbreviations бул. Витоша, гр. Софиябулевард Витоша, град София
Phone numbers +359 888 123 456 → digit-by-digit reading
Roman numerals век XXIвек двадесет и първи
Symbols №10номер десет

Grammatical Correctness

  • Gender agreement: Handles masculine/feminine/neuter (един/една/едно, два/две)
  • Ordinal forms: Full gender-aware ordinals (първи/първа/първо)
  • Year reading: Ordinal feminine form matching "година" (две хиляди двадесет и шеста)
  • Space-separated thousands: 7 000 000седем милиона

Usage

Quick usage

from bg_text_normalizer import normalize_text

result = normalize_text("На 15.02.2026 г. в 14:30 ч. цената е 1500.50 лв.")
# "На петнадесети февруари две хиляди двадесет и шеста година в четиринадесет
#  и тридесет часа цената е хиляда и петстотин лева и петдесет стотинки."

Class-based usage

from bg_text_normalizer import BulgarianTextNormalizer

normalizer = BulgarianTextNormalizer(expand_abbrevs=True, verbose=False)
result = normalizer.normalize("бул. Витоша №10, гр. София")
# "булевард Витоша номер десет, град София"

Individual modules

from bg_text_normalizer.bg_numbers import number_to_words_cardinal, number_to_words_ordinal
from bg_text_normalizer.bg_dates import normalize_date
from bg_text_normalizer.bg_currency import normalize_currency

number_to_words_cardinal(2500, gender='m')    # "две хиляди и петстотин"
number_to_words_ordinal(15, gender='m')       # "петнадесети"
normalize_date(15, 2, 2026)                   # "петнадесети февруари две хиляди двадесет и шеста"
normalize_currency("99.99", "BGN")            # "деветдесет и девет лева и деветдесет и девет стотинки"

Integration with TTS Training (Qwen3-TTS)

Use this normalizer as a preprocessing step when preparing your training data:

import json
from bg_text_normalizer import normalize_text

# Process your JSONL training data
with open('raw_data.jsonl', 'r') as f_in, open('normalized_data.jsonl', 'w') as f_out:
    for line in f_in:
        entry = json.loads(line)
        entry['text'] = normalize_text(entry['text'])
        f_out.write(json.dumps(entry, ensure_ascii=False) + '\n')

For inference (runtime TTS), add normalization before synthesis:

from bg_text_normalizer import normalize_text

def synthesize(text: str):
    normalized = normalize_text(text)
    # ... pass normalized text to TTS model

File Structure

bg-text-normalizer/
├── src/
│   └── bg_text_normalizer/
│       ├── __init__.py           # Package entry point
│       ├── bg_normalizer.py      # Main orchestrator
│       ├── bg_numbers.py         # Cardinal, ordinal, decimal numbers
│       ├── bg_dates.py           # Date normalization
│       ├── bg_time.py            # Time normalization
│       ├── bg_currency.py        # Currency (BGN, EUR, USD, GBP)
│       ├── bg_abbreviations.py   # 100+ Bulgarian abbreviations
│       ├── bg_phone.py           # Phone number reading
│       └── bg_roman.py           # Roman numeral conversion
├── test_normalizer.py            # Test suite
├── pyproject.toml
└── README.md

Adding Custom Abbreviations

Edit src/bg_text_normalizer/bg_abbreviations.py and add entries to the appropriate dictionary:

# In ADDRESS_ABBREVS, TITLE_ABBREVS, etc.
CUSTOM_ABBREVS = {
    'your_abbrev.': 'пълна форма',
}

Dependencies

None — pure Python, no external dependencies required.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bg_text_normalizer-1.1.0.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bg_text_normalizer-1.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file bg_text_normalizer-1.1.0.tar.gz.

File metadata

  • Download URL: bg_text_normalizer-1.1.0.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bg_text_normalizer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 990f8c5dd0cb05e5b06d6e97b09f593e5cb17c3638c38aae37b919a0235e6346
MD5 5af29cf37f8b08ad569c91fef0dffa82
BLAKE2b-256 eb82cac66a61767200b6461c551ec4bea8b933c5bb6b3203702e00c8686667f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for bg_text_normalizer-1.1.0.tar.gz:

Publisher: publish.yml on raditotev/bg-text-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bg_text_normalizer-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bg_text_normalizer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8f4e035f23e603720d678159a27a2b67868ce83722e853a9f63d59690249651
MD5 f3e9732765cf058ef5a355547d6aab41
BLAKE2b-256 4807e752df07e0feab01f93c97237554df300c98d2180aa31a1a123355029f9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bg_text_normalizer-1.1.0-py3-none-any.whl:

Publisher: publish.yml on raditotev/bg-text-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page