Skip to main content

Dhivehi text normalization for TTS frontends

Project description

dv-normalize

A Dhivehi text normalizer for TTS frontends. Converts numbers, dates, times, fractions, scores, abbreviations, money, percentages, and other non-Thaana input into spoken-form Dhivehi.

Status: v1.0 alpha (1.0.0a2). The new public API (normalize, Normalizer, NormalizerConfig) is the supported entry point. The legacy 0.1.x classes are still exported but emit a DeprecationWarning and will be removed in 2.0.

Installation

pip install dv-normalize

Quick start

from dv_normalize import normalize

normalize("ވަކި ލާރިން ވެސް 232.23 ލާރި ހޯދައެވެ")
# 'ވަކި ލާރިން ވެސް ދުއިސައްތަ ތިރީސް ދޭއް ޕޮއިންޓް ދޭއް ތިނެއް ލާރި ހޯދައޭ'

normalize("ޑރ. އިބްރާހިމް 14:30 ގައި އައި")
normalize("ކ.އަތޮޅު ވިލިނގިލިން 120 ކިލޯ މީޓަރު")
normalize("ފޭސް2ގެ")

For repeated use, hold onto a Normalizer instance:

from dv_normalize import Normalizer, NormalizerConfig

n = Normalizer(NormalizerConfig(keep_punctuation=False))
n("ހެލޯ، ދުނިޔެ")  # → 'ހެލޯ ދުނިޔެ'

What it handles

Class Example input Example output
Cardinal 232 ދުއިސައްތަ ތިރީސް ދޭއް
Comma-grouped 104,880 (single cardinal, not per-digit)
Per-digit 9982711 spelled digit-by-digit (7+ digit identifier)
Decimal 232.23 ދުއިސައްތަ ތިރީސް ދޭއް ޕޮއިންޓް ދޭއް ތިނެއް
Year 2024 ދެހާސް ސައުވީސް
Year range 1982 - 2024 … ން … އަށް
Time 14:30 ސާދަ ގަޑި ތިރީސް
Ordinal 11ވަނަ adnominal head form
Fraction 1/2 ދެބައިކުޅަ އެއްބައި
Mixed fraction 1 1/2 … އަދި …
Percent 25% ފަންސަވީސް ޕަސެންޓް
Oblique ref 2024/3 … ޚާއްސަ <denom-ordinal>
Score 3-2, 0-0, 5-0 compact draw / shutout forms
Money 52 ރ. Rufiyaa context-sensitive
Abbreviation ޑރ., ހއ. ޑޮކްޓަރު, ހާ އަލިފު
Compound abbrev ސ.ޢ.ވ. ޞައްލަﷲ ޢަލައިހި ވަސައްލަމް
Calendar marker 2026 މ., 1447 ހ. … މީލާދީ, … ހިޖުރީ
Sentence ending ހޯދައެވެ ހޯދައޭ (113 rules, context-sensitive)

The classifier is priority-ranked, so more specific patterns (calendar markers, multi-letter compound abbreviations, year ranges) shadow the generic ones. Tokens that don't match any rule pass through unchanged.

Configuration

NormalizerConfig(
    dialect="spoken",            # only option for now
    unknown_latin="passthrough", # "passthrough" | "drop" | "spell"
    decimal_separator="auto",    # "auto" | "dot" | "comma"
    time_system="auto",          # "auto" | "12" | "24"
    currency_default="MVR",
    keep_punctuation=True,
    diagnostic=False,
    strict=False,
)

Diagnostic mode

Normalizer.trace(text) returns the classified token list instead of joined text. Useful for debugging which rule fired:

for tok in Normalizer().trace("ޑރ. އިބްރާހިމް 2024ގައި"):
    print(tok.cls, tok.text, tok.spoken, tok.fields)

Legacy API

The 0.1.x classes (DhivehiNumberConverter, DhivehiTimeConverter, DhivehiYearConverter, DhivehiTextProcessor) are still importable from dv_normalize but emit a DeprecationWarning. They are scheduled for removal in 2.0 — migrate to normalize() / Normalizer.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dv_normalizer-0.1.7.tar.gz (48.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dv_normalizer-0.1.7-py3-none-any.whl (62.2 kB view details)

Uploaded Python 3

File details

Details for the file dv_normalizer-0.1.7.tar.gz.

File metadata

  • Download URL: dv_normalizer-0.1.7.tar.gz
  • Upload date:
  • Size: 48.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for dv_normalizer-0.1.7.tar.gz
Algorithm Hash digest
SHA256 e9a3f233fa8bbb7f11f89900e1510d6ba2aff7d6ad5c2d56a3538dfa879f42df
MD5 6bada27e124e8260ffcc6d8aeadac3f1
BLAKE2b-256 688c8238ef5d4df908be6ecf2bf9b0b6e8f8d7691b0a3d9bc7dfc0a6e35ca766

See more details on using hashes here.

File details

Details for the file dv_normalizer-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: dv_normalizer-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 62.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for dv_normalizer-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 44578b74dadf0d56dfd44ce5a2e96b440fe4b272972716be789f971270f409ef
MD5 4ab3fcb9c7c6b60f4017bf191fc17cce
BLAKE2b-256 3baff5b9eb63f6a94a22dace70b2dd43d6d3cf59a7fa0c66746f1f7af9a4c143

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page