Skip to main content

A Python library for normalizing Vietnamese text for TTS and NLP applications

Project description

Vietnamese Text Normalizer

PyPI version Python 3.8+

A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Ported from nghitts.

Features

  • Number Conversion: Numbers to Vietnamese words (123một trăm hai mươi ba)
  • Date & Time: Full date/time conversion including date ranges (25-26/12hai mươi lăm đến hai mươi sáu tháng mười hai)
  • Currency: VND and USD amounts (50.000đnăm mươi nghìn đồng)
  • Percentages: Including ranges (3-5%ba đến năm phần trăm)
  • Year Ranges: 1873-1907một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy
  • Ordinals: thứ 2thứ hai
  • Phone Numbers: Digit-by-digit reading
  • Measurement Units: 120km/hmột trăm hai mươi ki-lô-mét trên giờ
  • Acronym Expansion: Dictionary-based (NASAna-sa)
  • Non-Vietnamese Word Replacement: Dictionary-based (containercông-tê-nơ)
  • Rule-based Transliteration: Words NOT in dictionaries are automatically transliterated to Vietnamese phonetics (algorithma-go-rít)
  • Vietnamese Word Detection: Automatically detects Vietnamese words and skips them during transliteration
  • Text Cleaning: Removes emojis, URLs, emails, normalizes Unicode and punctuation
  • Special Characters: &, @a còng, #thăng
  • High Performance: ~0.6ms per call with 17K+ dictionary entries

Installation

pip install vietnormalizer

Or install from source:

git clone https://github.com/nghimestudio/vietnormalizer.git
cd vietnormalizer
pip install -e .

Quick Start

from vietnormalizer import VietnameseNormalizer

normalizer = VietnameseNormalizer()

# Numbers, dates, and times
normalizer.normalize("Hôm nay là 25/12/2023, lúc 14:30")
# → "hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi phút"

# Non-Vietnamese word replacement (from built-in dictionary)
normalizer.normalize("Hello container from Singapore")
# → "hê-lô công-tê-nơ phờ-rôm xin-ga-po"

# Acronym expansion (from built-in dictionary)
normalizer.normalize("Tôi xem TV và dùng AI hàng ngày")
# → "tôi xem ti vi và dùng trí tuệ nhân tạo hàng ngày"

# Rule-based transliteration for words NOT in dictionary
normalizer.normalize("database server configuration")
# → "đa-ta-bê xơ-vơ con-phi-gu-raân"

# Measurement units
normalizer.normalize("Tốc độ 120km/h, diện tích 500m2")
# → "tốc độ một trăm hai mươi ki-lô-mét trên giờ, diện tích năm trăm mét vuông"

# Currency with thousand separators
normalizer.normalize("Giá 50.000đ cho mỗi người")
# → "giá năm mươi nghìn đồng cho mỗi người"

# Year ranges, ordinals, percentages
normalizer.normalize("1873-1907, thứ 2, tăng 6,5%")
# → "một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy, thứ hai, tăng sáu phẩy năm phần trăm"

# Date ranges
normalizer.normalize("ngày 25-26/12/2023")
# → "ngày hai mươi lăm đến hai mươi sáu tháng mười hai năm hai nghìn không trăm hai mươi ba"

# Percentage ranges
normalizer.normalize("3-5% dân số")
# → "ba đến năm phần trăm dân số"

Transliteration Control

# Disable transliteration (only use CSV dictionary replacements)
normalizer = VietnameseNormalizer(enable_transliteration=False)
normalizer.normalize("machine learning algorithm")
# → "ma-sin li-nin algorithm"  (words in CSV replaced, others kept as-is)

# Or override per-call
normalizer = VietnameseNormalizer(enable_transliteration=True)
normalizer.normalize("machine learning", enable_transliteration=False)

Vietnamese Word Detection

from vietnormalizer import is_vietnamese_word

is_vietnamese_word("xin")      # True (valid Vietnamese structure)
is_vietnamese_word("chào")     # True (has Vietnamese diacritics)
is_vietnamese_word("database") # False (contains 'b' ending, invalid structure)
is_vietnamese_word("flow")     # False (contains 'f' and 'w')

Direct Transliteration

from vietnormalizer import transliterate_word, english_to_vietnamese

transliterate_word("database")   # "đa-ta-bâi" (checks if Vietnamese first)
transliterate_word("xin")        # "xin" (detected as Vietnamese, kept as-is)
english_to_vietnamese("computer") # "com-pu-tơ" (always transliterates)

Custom Dictionaries

normalizer = VietnameseNormalizer(
    acronyms_path="path/to/acronyms.csv",
    non_vietnamese_words_path="path/to/words.csv"
)

# Or specify a directory containing both files
normalizer = VietnameseNormalizer(data_dir="path/to/data/")

# Reload dictionaries at runtime
normalizer.reload_dictionaries(acronyms_path="path/to/updated.csv")

CSV Formats

acronyms.csv:

acronym,transliteration
NASA,na-sa
GDP,tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo

non-vietnamese-words.csv:

original,transliteration
container,công-tê-nơ
singapore,xin-ga-po
server,xơ-vơ

Advanced Usage

Using the Processor Directly

from vietnormalizer import VietnameseTextProcessor

processor = VietnameseTextProcessor()

# Convert numbers
processor.number_to_words("123")  # "một trăm hai mươi ba"

# Process text (numbers, dates, times, units - no dictionary replacements)
processor.process_vietnamese_text("Giá 50.000đ lúc 15h30")

Disable Preprocessing

# Only apply dictionary replacements and transliteration, skip number/date conversion
normalizer.normalize(text, enable_preprocessing=False)

Processing Pipeline

The normalization follows this pipeline (matching nghitts):

  1. Unicode normalization (NFC)
  2. Special character replacement (&, @a còng, URL/email removal)
  3. Punctuation normalization
  4. Text cleaning (emojis, non-Latin chars)
  5. Year range conversion
  6. Percentage range conversion
  7. Date/time conversion (including ranges)
  8. Ordinal conversion
  9. Thousand separator removal
  10. Currency conversion
  11. Remaining percentage conversion
  12. Phone number conversion
  13. Decimal conversion
  14. Measurement unit conversion
  15. Standalone number conversion
  16. Lowercase normalization
  17. Acronym replacement (from CSV)
  18. Non-Vietnamese word replacement (from CSV)
  19. Rule-based transliteration (for remaining non-Vietnamese words)

Performance

  • ~0.6ms per normalization call with 17K+ dictionary entries
  • All regex patterns pre-compiled at initialization
  • Dictionary lookups use O(1) hash map instead of regex alternation
  • Total initialization time: ~40ms

Requirements

  • Python 3.8+
  • No external dependencies (uses only standard library)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Ported from the JavaScript implementations in nghitts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietnormalizer-0.2.3.tar.gz (180.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vietnormalizer-0.2.3-py3-none-any.whl (178.8 kB view details)

Uploaded Python 3

File details

Details for the file vietnormalizer-0.2.3.tar.gz.

File metadata

  • Download URL: vietnormalizer-0.2.3.tar.gz
  • Upload date:
  • Size: 180.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vietnormalizer-0.2.3.tar.gz
Algorithm Hash digest
SHA256 cb5ee9a93f1a1bd064fe2d47292cdba755d4a8e353453e9defadbcdd8ea49999
MD5 90e5ca7974864162e63fae6c45ef1443
BLAKE2b-256 d361b11a8035eada9b328fd6536b0da7b41dfe66ff0e8d4998bb1a3b4b955302

See more details on using hashes here.

File details

Details for the file vietnormalizer-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: vietnormalizer-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 178.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vietnormalizer-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0ccddbccb52c26e5b06f164f737b07feb1d7833432128dd093119fd59c7ce557
MD5 f2d3868b197ef7f82dddd024acae97d3
BLAKE2b-256 cf34988fb8af8f3e0520f279f1ecda36a20b347367921c27ae1dc7d57f3067f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page