Skip to main content

A Python library for normalizing Vietnamese text for TTS and NLP applications

Project description

Vietnamese Text Normalizer

PyPI version Python 3.8+

A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Ported from nghitts.

Features

  • Number Conversion: Numbers to Vietnamese words (123một trăm hai mươi ba)
  • Date & Time: Full date/time conversion including date ranges (25-26/12hai mươi lăm đến hai mươi sáu tháng mười hai)
  • Currency: VND and USD amounts (50.000đnăm mươi nghìn đồng)
  • Percentages: Including ranges (3-5%ba đến năm phần trăm)
  • Year Ranges: 1873-1907một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy
  • Ordinals: thứ 2thứ hai
  • Phone Numbers: Digit-by-digit reading
  • Measurement Units: 120km/hmột trăm hai mươi ki-lô-mét trên giờ
  • Acronym Expansion: Dictionary-based (NASAna-sa)
  • Non-Vietnamese Word Replacement: Dictionary-based (containercông-tê-nơ)
  • Optional Rule-based Transliteration: Words NOT in dictionaries can be transliterated to Vietnamese phonetics when enabled (algorithma-go-rít)
  • Vietnamese Word Detection: Automatically detects Vietnamese words and skips them during transliteration
  • Text Cleaning: Removes emojis, URLs, emails, normalizes Unicode and punctuation
  • Special Characters: &, @a còng, #thăng
  • High Performance: ~0.6ms per call with 17K+ dictionary entries

Installation

pip install vietnormalizer-thuan

Or install from source:

git clone https://github.com/iamdinhthuan/vietnormalizer.git
cd vietnormalizer
pip install -e .

Quick Start

from vietnormalizer import VietnameseNormalizer

normalizer = VietnameseNormalizer()

# Numbers, dates, and times
normalizer.normalize("Hôm nay là 25/12/2023, lúc 14:30")
# → "hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi phút"

# Non-Vietnamese word replacement (from built-in dictionary, when enabled)
normalizer.normalize("Hello container from Singapore")
# → "hello container from singapore"

# Acronym expansion (from built-in dictionary)
normalizer.normalize("Tôi xem TV và dùng AI hàng ngày")
# → "tôi xem ti vi và dùng trí tuệ nhân tạo hàng ngày"

# By default, English words are kept as-is
normalizer.normalize("database server configuration")
# → "database server configuration"

# Measurement units
normalizer.normalize("Tốc độ 120km/h, diện tích 500m2")
# → "tốc độ một trăm hai mươi ki-lô-mét trên giờ, diện tích năm trăm mét vuông"

# Currency with thousand separators
normalizer.normalize("Giá 50.000đ cho mỗi người")
# → "giá năm mươi nghìn đồng cho mỗi người"

# Year ranges, ordinals, percentages
normalizer.normalize("1873-1907, thứ 2, tăng 6,5%")
# → "một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy, thứ hai, tăng sáu phẩy năm phần trăm"

# Date ranges
normalizer.normalize("ngày 25-26/12/2023")
# → "ngày hai mươi lăm đến hai mươi sáu tháng mười hai năm hai nghìn không trăm hai mươi ba"

# Percentage ranges
normalizer.normalize("3-5% dân số")
# → "ba đến năm phần trăm dân số"

Transliteration Control

# English-to-Vietnamese pronunciation is disabled by default; enable it explicitly when needed
normalizer = VietnameseNormalizer(enable_transliteration=True)
normalizer.normalize("Hello container from Singapore")
# → "hê-lô công-tê-nơ phờ-rôm xin-ga-po"

normalizer.normalize("machine learning algorithm")
# → "ma-xin lơn-ning a-go-rít"

# Or override per-call
normalizer.normalize("machine learning", enable_transliteration=False)

Vietnamese Word Detection

from vietnormalizer import is_vietnamese_word

is_vietnamese_word("xin")      # True (valid Vietnamese structure)
is_vietnamese_word("chào")     # True (has Vietnamese diacritics)
is_vietnamese_word("database") # False (contains 'b' ending, invalid structure)
is_vietnamese_word("flow")     # False (contains 'f' and 'w')

Direct Transliteration

from vietnormalizer import transliterate_word, english_to_vietnamese

transliterate_word("database")   # "đa-ta-bâi" (checks if Vietnamese first)
transliterate_word("xin")        # "xin" (detected as Vietnamese, kept as-is)
english_to_vietnamese("computer") # "com-pu-tơ" (always transliterates)

Custom Dictionaries

normalizer = VietnameseNormalizer(
    acronyms_path="path/to/acronyms.csv",
    non_vietnamese_words_path="path/to/words.csv"
)

# Or specify a directory containing both files
normalizer = VietnameseNormalizer(data_dir="path/to/data/")

# Reload dictionaries at runtime
normalizer.reload_dictionaries(acronyms_path="path/to/updated.csv")

CSV Formats

acronyms.csv:

acronym,transliteration
NASA,na-sa
GDP,tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo

non-vietnamese-words.csv:

original,transliteration
container,công-tê-nơ
singapore,xin-ga-po
server,xơ-vơ

Advanced Usage

Using the Processor Directly

from vietnormalizer import VietnameseTextProcessor

processor = VietnameseTextProcessor()

# Convert numbers
processor.number_to_words("123")  # "một trăm hai mươi ba"

# Process text (numbers, dates, times, units - no dictionary replacements)
processor.process_vietnamese_text("Giá 50.000đ lúc 15h30")

Disable Preprocessing

# Only apply dictionary replacements and transliteration, skip number/date conversion
normalizer.normalize(text, enable_preprocessing=False)

Processing Pipeline

The normalization follows this pipeline (matching nghitts):

  1. Unicode normalization (NFC)
  2. Special character replacement (&, @a còng, URL/email removal)
  3. Punctuation normalization
  4. Text cleaning (emojis, non-Latin chars)
  5. Year range conversion
  6. Percentage range conversion
  7. Date/time conversion (including ranges)
  8. Ordinal conversion
  9. Thousand separator removal
  10. Currency conversion
  11. Remaining percentage conversion
  12. Phone number conversion
  13. Decimal conversion
  14. Measurement unit conversion
  15. Standalone number conversion
  16. Lowercase normalization
  17. Acronym replacement (from CSV)
  18. Non-Vietnamese word replacement (from CSV)
  19. Optional rule-based transliteration (for remaining non-Vietnamese words)

Performance

  • ~0.6ms per normalization call with 17K+ dictionary entries
  • All regex patterns pre-compiled at initialization
  • Dictionary lookups use O(1) hash map instead of regex alternation
  • Total initialization time: ~40ms

Requirements

  • Python 3.8+
  • No external dependencies (uses only standard library)

Publishing

To release a new version to PyPI, see docs/publish-to-pypi.md. Quick path: bump version in pyproject.toml, setup.py, and vietnormalizer/__init__.py, then run ./scripts/publish-to-pypi.sh.

License

MIT License

Contributing

Contributions are welcome! See CONTRIBUTING.md for a short guide (fork, quick wins like fixing typos, and how to open a Pull Request).

Acknowledgments

Ported from the JavaScript implementations in nghitts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietnormalizer_thuan-0.3.0.tar.gz (180.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vietnormalizer_thuan-0.3.0-py3-none-any.whl (179.1 kB view details)

Uploaded Python 3

File details

Details for the file vietnormalizer_thuan-0.3.0.tar.gz.

File metadata

  • Download URL: vietnormalizer_thuan-0.3.0.tar.gz
  • Upload date:
  • Size: 180.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vietnormalizer_thuan-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b847299462e892b693bcb71dfae93f84463b381ed77e47cd5a570a95a40cc215
MD5 4e4e22684bc37b363b7a055eddb1b141
BLAKE2b-256 c4cde3fd22029c9ce2492ddbaa37a50d1c7db38289adc12072f588fe00058f95

See more details on using hashes here.

File details

Details for the file vietnormalizer_thuan-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for vietnormalizer_thuan-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 539e42864dcaa1df4dad6007f8756df67abdb0fbeb64a8e0f3558cc25de65488
MD5 1525ac41b83c55db923cae2113318b79
BLAKE2b-256 595ac0a0d48ecf986655dba7ec6bcacfb4479484fa152f7929e9fc76397d5914

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page