Skip to main content

A Python library for normalizing Vietnamese text for TTS and NLP applications

Project description

Vietnamese Text Normalizer

PyPI version Python 3.8+

A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Ported from nghitts.

Features

  • Number Conversion: Numbers to Vietnamese words (123một trăm hai mươi ba)
  • Date & Time: Full date/time conversion including date ranges (25-26/12hai mươi lăm đến hai mươi sáu tháng mười hai)
  • Currency: VND and USD amounts (50.000đnăm mươi nghìn đồng)
  • Percentages: Including ranges (3-5%ba đến năm phần trăm)
  • Year Ranges: 1873-1907một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy
  • Ordinals: thứ 2thứ hai
  • Phone Numbers: Digit-by-digit reading
  • Measurement Units: 120km/hmột trăm hai mươi ki-lô-mét trên giờ
  • Acronym Expansion: Dictionary-based (NASAna-sa)
  • Non-Vietnamese Word Replacement: Dictionary-based (containercông-tê-nơ)
  • Rule-based Transliteration: Words NOT in dictionaries are automatically transliterated to Vietnamese phonetics (algorithma-go-rít)
  • Vietnamese Word Detection: Automatically detects Vietnamese words and skips them during transliteration
  • Text Cleaning: Removes emojis, URLs, emails, normalizes Unicode and punctuation
  • Special Characters: &, @a còng, #thăng
  • High Performance: ~0.6ms per call with 17K+ dictionary entries

Installation

pip install vietnormalizer

Or install from source:

git clone https://github.com/nghimestudio/vietnormalizer.git
cd vietnormalizer
pip install -e .

Quick Start

from vietnormalizer import VietnameseNormalizer

normalizer = VietnameseNormalizer()

# Numbers, dates, and times
normalizer.normalize("Hôm nay là 25/12/2023, lúc 14:30")
# → "hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi phút"

# Non-Vietnamese word replacement (from built-in dictionary)
normalizer.normalize("Hello container from Singapore")
# → "hê-lô công-tê-nơ phờ-rôm xin-ga-po"

# Acronym expansion (from built-in dictionary)
normalizer.normalize("Tôi xem TV và dùng AI hàng ngày")
# → "tôi xem ti vi và dùng trí tuệ nhân tạo hàng ngày"

# Rule-based transliteration for words NOT in dictionary
normalizer.normalize("database server configuration")
# → "đa-ta-bê xơ-vơ con-phi-gu-raân"

# Measurement units
normalizer.normalize("Tốc độ 120km/h, diện tích 500m2")
# → "tốc độ một trăm hai mươi ki-lô-mét trên giờ, diện tích năm trăm mét vuông"

# Currency with thousand separators
normalizer.normalize("Giá 50.000đ cho mỗi người")
# → "giá năm mươi nghìn đồng cho mỗi người"

# Year ranges, ordinals, percentages
normalizer.normalize("1873-1907, thứ 2, tăng 6,5%")
# → "một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy, thứ hai, tăng sáu phẩy năm phần trăm"

# Date ranges
normalizer.normalize("ngày 25-26/12/2023")
# → "ngày hai mươi lăm đến hai mươi sáu tháng mười hai năm hai nghìn không trăm hai mươi ba"

# Percentage ranges
normalizer.normalize("3-5% dân số")
# → "ba đến năm phần trăm dân số"

Transliteration Control

# Disable transliteration (only use CSV dictionary replacements)
normalizer = VietnameseNormalizer(enable_transliteration=False)
normalizer.normalize("machine learning algorithm")
# → "ma-sin li-nin algorithm"  (words in CSV replaced, others kept as-is)

# Or override per-call
normalizer = VietnameseNormalizer(enable_transliteration=True)
normalizer.normalize("machine learning", enable_transliteration=False)

Vietnamese Word Detection

from vietnormalizer import is_vietnamese_word

is_vietnamese_word("xin")      # True (valid Vietnamese structure)
is_vietnamese_word("chào")     # True (has Vietnamese diacritics)
is_vietnamese_word("database") # False (contains 'b' ending, invalid structure)
is_vietnamese_word("flow")     # False (contains 'f' and 'w')

Direct Transliteration

from vietnormalizer import transliterate_word, english_to_vietnamese

transliterate_word("database")   # "đa-ta-bâi" (checks if Vietnamese first)
transliterate_word("xin")        # "xin" (detected as Vietnamese, kept as-is)
english_to_vietnamese("computer") # "com-pu-tơ" (always transliterates)

Custom Dictionaries

normalizer = VietnameseNormalizer(
    acronyms_path="path/to/acronyms.csv",
    non_vietnamese_words_path="path/to/words.csv"
)

# Or specify a directory containing both files
normalizer = VietnameseNormalizer(data_dir="path/to/data/")

# Reload dictionaries at runtime
normalizer.reload_dictionaries(acronyms_path="path/to/updated.csv")

CSV Formats

acronyms.csv:

acronym,transliteration
NASA,na-sa
GDP,tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo

non-vietnamese-words.csv:

original,transliteration
container,công-tê-nơ
singapore,xin-ga-po
server,xơ-vơ

Advanced Usage

Using the Processor Directly

from vietnormalizer import VietnameseTextProcessor

processor = VietnameseTextProcessor()

# Convert numbers
processor.number_to_words("123")  # "một trăm hai mươi ba"

# Process text (numbers, dates, times, units - no dictionary replacements)
processor.process_vietnamese_text("Giá 50.000đ lúc 15h30")

Disable Preprocessing

# Only apply dictionary replacements and transliteration, skip number/date conversion
normalizer.normalize(text, enable_preprocessing=False)

Processing Pipeline

The normalization follows this pipeline (matching nghitts):

  1. Unicode normalization (NFC)
  2. Special character replacement (&, @a còng, URL/email removal)
  3. Punctuation normalization
  4. Text cleaning (emojis, non-Latin chars)
  5. Year range conversion
  6. Percentage range conversion
  7. Date/time conversion (including ranges)
  8. Ordinal conversion
  9. Thousand separator removal
  10. Currency conversion
  11. Remaining percentage conversion
  12. Phone number conversion
  13. Decimal conversion
  14. Measurement unit conversion
  15. Standalone number conversion
  16. Lowercase normalization
  17. Acronym replacement (from CSV)
  18. Non-Vietnamese word replacement (from CSV)
  19. Rule-based transliteration (for remaining non-Vietnamese words)

Performance

  • ~0.6ms per normalization call with 17K+ dictionary entries
  • All regex patterns pre-compiled at initialization
  • Dictionary lookups use O(1) hash map instead of regex alternation
  • Total initialization time: ~40ms

Requirements

  • Python 3.8+
  • No external dependencies (uses only standard library)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Ported from the JavaScript implementations in nghitts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietnormalizer-0.2.1.tar.gz (179.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vietnormalizer-0.2.1-py3-none-any.whl (177.5 kB view details)

Uploaded Python 3

File details

Details for the file vietnormalizer-0.2.1.tar.gz.

File metadata

  • Download URL: vietnormalizer-0.2.1.tar.gz
  • Upload date:
  • Size: 179.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vietnormalizer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 fbd3e2bdc5ee61e14b6f714f709e539d0aa8cab44f5c38b31c93250068b90f8f
MD5 1f6af383cdb48ac64e37bc1d40606210
BLAKE2b-256 fe80fc894b97f92ca48817e88732a304cdcec9531c94d2a29dc15e1a0397c711

See more details on using hashes here.

File details

Details for the file vietnormalizer-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: vietnormalizer-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 177.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vietnormalizer-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1ef94e4163ff75ae46f0b1c3862c44ff7af94e61660301e4c3a83cdbe6efd988
MD5 c2f4cdb228de5cc7ca67c38313c4f111
BLAKE2b-256 f60920f3a4e99ba7f88b52294114e2c08a0807e1f82880d58bbaa3f6da87a929

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page