A Python library for normalizing Vietnamese text for TTS and NLP applications

These details have not been verified by PyPI

Project links

Project description

Vietnamese Text Normalizer

A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Ported from nghitts.

Features

Number Conversion: Numbers to Vietnamese words (123 → một trăm hai mươi ba)
Date & Time: Full date/time conversion including date ranges (25-26/12 → hai mươi lăm đến hai mươi sáu tháng mười hai)
Currency: VND and USD amounts (50.000đ → năm mươi nghìn đồng)
Percentages: Including ranges (3-5% → ba đến năm phần trăm)
Year Ranges: 1873-1907 → một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy
Ordinals: thứ 2 → thứ hai
Phone Numbers: Digit-by-digit reading
Measurement Units: 120km/h → một trăm hai mươi ki-lô-mét trên giờ
Acronym Expansion: Dictionary-based (NASA → na-sa)
Non-Vietnamese Word Replacement: Dictionary-based (container → công-tê-nơ)
Optional Rule-based Transliteration: Words NOT in dictionaries can be transliterated to Vietnamese phonetics when enabled (algorithm → a-go-rít)
Vietnamese Word Detection: Automatically detects Vietnamese words and skips them during transliteration
Text Cleaning: Removes emojis, URLs, emails, normalizes Unicode and punctuation
Special Characters: & → và, @ → a còng, # → thăng
High Performance: ~0.6ms per call with 17K+ dictionary entries

Installation

pip install vietnormalizer-thuan

Or install from source:

git clone https://github.com/iamdinhthuan/vietnormalizer.git
cd vietnormalizer
pip install -e .

Quick Start

from vietnormalizer import VietnameseNormalizer

normalizer = VietnameseNormalizer()

# Numbers, dates, and times
normalizer.normalize("Hôm nay là 25/12/2023, lúc 14:30")
# → "hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi phút"

# Non-Vietnamese word replacement (from built-in dictionary, when enabled)
normalizer.normalize("Hello container from Singapore")
# → "hello container from singapore"

# Acronym expansion (from built-in dictionary)
normalizer.normalize("Tôi xem TV và dùng AI hàng ngày")
# → "tôi xem ti vi và dùng trí tuệ nhân tạo hàng ngày"

# By default, English words are kept as-is
normalizer.normalize("database server configuration")
# → "database server configuration"

# Measurement units
normalizer.normalize("Tốc độ 120km/h, diện tích 500m2")
# → "tốc độ một trăm hai mươi ki-lô-mét trên giờ, diện tích năm trăm mét vuông"

# Currency with thousand separators
normalizer.normalize("Giá 50.000đ cho mỗi người")
# → "giá năm mươi nghìn đồng cho mỗi người"

# Year ranges, ordinals, percentages
normalizer.normalize("1873-1907, thứ 2, tăng 6,5%")
# → "một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy, thứ hai, tăng sáu phẩy năm phần trăm"

# Date ranges
normalizer.normalize("ngày 25-26/12/2023")
# → "ngày hai mươi lăm đến hai mươi sáu tháng mười hai năm hai nghìn không trăm hai mươi ba"

# Percentage ranges
normalizer.normalize("3-5% dân số")
# → "ba đến năm phần trăm dân số"

Transliteration Control

# English-to-Vietnamese pronunciation is disabled by default; enable it explicitly when needed
normalizer = VietnameseNormalizer(enable_transliteration=True)
normalizer.normalize("Hello container from Singapore")
# → "hê-lô công-tê-nơ phờ-rôm xin-ga-po"

normalizer.normalize("machine learning algorithm")
# → "ma-xin lơn-ning a-go-rít"

# Or override per-call
normalizer.normalize("machine learning", enable_transliteration=False)

Vietnamese Word Detection

from vietnormalizer import is_vietnamese_word

is_vietnamese_word("xin")      # True (valid Vietnamese structure)
is_vietnamese_word("chào")     # True (has Vietnamese diacritics)
is_vietnamese_word("database") # False (contains 'b' ending, invalid structure)
is_vietnamese_word("flow")     # False (contains 'f' and 'w')

Direct Transliteration

from vietnormalizer import transliterate_word, english_to_vietnamese

transliterate_word("database")   # "đa-ta-bâi" (checks if Vietnamese first)
transliterate_word("xin")        # "xin" (detected as Vietnamese, kept as-is)
english_to_vietnamese("computer") # "com-pu-tơ" (always transliterates)

Custom Dictionaries

normalizer = VietnameseNormalizer(
    acronyms_path="path/to/acronyms.csv",
    non_vietnamese_words_path="path/to/words.csv"
)

# Or specify a directory containing both files
normalizer = VietnameseNormalizer(data_dir="path/to/data/")

# Reload dictionaries at runtime
normalizer.reload_dictionaries(acronyms_path="path/to/updated.csv")

CSV Formats

acronyms.csv:

acronym,transliteration
NASA,na-sa
GDP,tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo

non-vietnamese-words.csv:

original,transliteration
container,công-tê-nơ
singapore,xin-ga-po
server,xơ-vơ

Advanced Usage

Using the Processor Directly

from vietnormalizer import VietnameseTextProcessor

processor = VietnameseTextProcessor()

# Convert numbers
processor.number_to_words("123")  # "một trăm hai mươi ba"

# Process text (numbers, dates, times, units - no dictionary replacements)
processor.process_vietnamese_text("Giá 50.000đ lúc 15h30")

Disable Preprocessing

# Only apply dictionary replacements and transliteration, skip number/date conversion
normalizer.normalize(text, enable_preprocessing=False)

Processing Pipeline

The normalization follows this pipeline (matching nghitts):

Unicode normalization (NFC)
Special character replacement (& → và, @ → a còng, URL/email removal)
Punctuation normalization
Text cleaning (emojis, non-Latin chars)
Year range conversion
Percentage range conversion
Date/time conversion (including ranges)
Ordinal conversion
Thousand separator removal
Currency conversion
Remaining percentage conversion
Phone number conversion
Decimal conversion
Measurement unit conversion
Standalone number conversion
Lowercase normalization
Acronym replacement (from CSV)
Non-Vietnamese word replacement (from CSV)
Optional rule-based transliteration (for remaining non-Vietnamese words)

Performance

~0.6ms per normalization call with 17K+ dictionary entries
All regex patterns pre-compiled at initialization
Dictionary lookups use O(1) hash map instead of regex alternation
Total initialization time: ~40ms

Requirements

Python 3.8+
No external dependencies (uses only standard library)

Publishing

To release a new version to PyPI, see docs/publish-to-pypi.md. Quick path: bump version in pyproject.toml, setup.py, and vietnormalizer/__init__.py, then run ./scripts/publish-to-pypi.sh.

License

MIT License

Contributing

Contributions are welcome! See CONTRIBUTING.md for a short guide (fork, quick wins like fixing typos, and how to open a Pull Request).

Acknowledgments

Ported from the JavaScript implementations in nghitts.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietnormalizer_thuan-0.3.0.tar.gz (180.7 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vietnormalizer_thuan-0.3.0-py3-none-any.whl (179.1 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file vietnormalizer_thuan-0.3.0.tar.gz.

File metadata

Download URL: vietnormalizer_thuan-0.3.0.tar.gz
Upload date: Apr 1, 2026
Size: 180.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vietnormalizer_thuan-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`b847299462e892b693bcb71dfae93f84463b381ed77e47cd5a570a95a40cc215`
MD5	`4e4e22684bc37b363b7a055eddb1b141`
BLAKE2b-256	`c4cde3fd22029c9ce2492ddbaa37a50d1c7db38289adc12072f588fe00058f95`

See more details on using hashes here.

File details

Details for the file vietnormalizer_thuan-0.3.0-py3-none-any.whl.

File metadata

Download URL: vietnormalizer_thuan-0.3.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 179.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vietnormalizer_thuan-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`539e42864dcaa1df4dad6007f8756df67abdb0fbeb64a8e0f3558cc25de65488`
MD5	`1525ac41b83c55db923cae2113318b79`
BLAKE2b-256	`595ac0a0d48ecf986655dba7ec6bcacfb4479484fa152f7929e9fc76397d5914`

See more details on using hashes here.

vietnormalizer-thuan 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Vietnamese Text Normalizer

Features

Installation

Quick Start

Transliteration Control

Vietnamese Word Detection

Direct Transliteration

Custom Dictionaries

CSV Formats

Advanced Usage

Using the Processor Directly

Disable Preprocessing

Processing Pipeline

Performance

Requirements

Publishing

License

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes