A Python library for normalizing Vietnamese text for TTS and NLP applications
Project description
Vietnamese Text Normalizer
A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Ported from nghitts.
Features
- Number Conversion: Numbers to Vietnamese words (
123→một trăm hai mươi ba) - Date & Time: Full date/time conversion including date ranges (
25-26/12→hai mươi lăm đến hai mươi sáu tháng mười hai) - Currency: VND and USD amounts (
50.000đ→năm mươi nghìn đồng) - Percentages: Including ranges (
3-5%→ba đến năm phần trăm) - Year Ranges:
1873-1907→một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy - Ordinals:
thứ 2→thứ hai - Phone Numbers: Digit-by-digit reading
- Measurement Units:
120km/h→một trăm hai mươi ki-lô-mét trên giờ - Acronym Expansion: Dictionary-based (
NASA→na-sa) - Non-Vietnamese Word Replacement: Dictionary-based (
container→công-tê-nơ) - Rule-based Transliteration: Words NOT in dictionaries are automatically transliterated to Vietnamese phonetics (
algorithm→a-go-rít) - Vietnamese Word Detection: Automatically detects Vietnamese words and skips them during transliteration
- Text Cleaning: Removes emojis, URLs, emails, normalizes Unicode and punctuation
- Special Characters:
&→và,@→a còng,#→thăng - High Performance: ~0.6ms per call with 17K+ dictionary entries
Installation
pip install vietnormalizer
Or install from source:
git clone https://github.com/nghimestudio/vietnormalizer.git
cd vietnormalizer
pip install -e .
Quick Start
from vietnormalizer import VietnameseNormalizer
normalizer = VietnameseNormalizer()
# Numbers, dates, and times
normalizer.normalize("Hôm nay là 25/12/2023, lúc 14:30")
# → "hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi phút"
# Non-Vietnamese word replacement (from built-in dictionary)
normalizer.normalize("Hello container from Singapore")
# → "hê-lô công-tê-nơ phờ-rôm xin-ga-po"
# Acronym expansion (from built-in dictionary)
normalizer.normalize("Tôi xem TV và dùng AI hàng ngày")
# → "tôi xem ti vi và dùng trí tuệ nhân tạo hàng ngày"
# Rule-based transliteration for words NOT in dictionary
normalizer.normalize("database server configuration")
# → "đa-ta-bê xơ-vơ con-phi-gu-raân"
# Measurement units
normalizer.normalize("Tốc độ 120km/h, diện tích 500m2")
# → "tốc độ một trăm hai mươi ki-lô-mét trên giờ, diện tích năm trăm mét vuông"
# Currency with thousand separators
normalizer.normalize("Giá 50.000đ cho mỗi người")
# → "giá năm mươi nghìn đồng cho mỗi người"
# Year ranges, ordinals, percentages
normalizer.normalize("1873-1907, thứ 2, tăng 6,5%")
# → "một nghìn tám trăm bảy mươi ba đến một nghìn chín trăm lẻ bảy, thứ hai, tăng sáu phẩy năm phần trăm"
# Date ranges
normalizer.normalize("ngày 25-26/12/2023")
# → "ngày hai mươi lăm đến hai mươi sáu tháng mười hai năm hai nghìn không trăm hai mươi ba"
# Percentage ranges
normalizer.normalize("3-5% dân số")
# → "ba đến năm phần trăm dân số"
Transliteration Control
# Disable transliteration (only use CSV dictionary replacements)
normalizer = VietnameseNormalizer(enable_transliteration=False)
normalizer.normalize("machine learning algorithm")
# → "ma-sin li-nin algorithm" (words in CSV replaced, others kept as-is)
# Or override per-call
normalizer = VietnameseNormalizer(enable_transliteration=True)
normalizer.normalize("machine learning", enable_transliteration=False)
Vietnamese Word Detection
from vietnormalizer import is_vietnamese_word
is_vietnamese_word("xin") # True (valid Vietnamese structure)
is_vietnamese_word("chào") # True (has Vietnamese diacritics)
is_vietnamese_word("database") # False (contains 'b' ending, invalid structure)
is_vietnamese_word("flow") # False (contains 'f' and 'w')
Direct Transliteration
from vietnormalizer import transliterate_word, english_to_vietnamese
transliterate_word("database") # "đa-ta-bâi" (checks if Vietnamese first)
transliterate_word("xin") # "xin" (detected as Vietnamese, kept as-is)
english_to_vietnamese("computer") # "com-pu-tơ" (always transliterates)
Custom Dictionaries
normalizer = VietnameseNormalizer(
acronyms_path="path/to/acronyms.csv",
non_vietnamese_words_path="path/to/words.csv"
)
# Or specify a directory containing both files
normalizer = VietnameseNormalizer(data_dir="path/to/data/")
# Reload dictionaries at runtime
normalizer.reload_dictionaries(acronyms_path="path/to/updated.csv")
CSV Formats
acronyms.csv:
acronym,transliteration
NASA,na-sa
GDP,tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo
non-vietnamese-words.csv:
original,transliteration
container,công-tê-nơ
singapore,xin-ga-po
server,xơ-vơ
Advanced Usage
Using the Processor Directly
from vietnormalizer import VietnameseTextProcessor
processor = VietnameseTextProcessor()
# Convert numbers
processor.number_to_words("123") # "một trăm hai mươi ba"
# Process text (numbers, dates, times, units - no dictionary replacements)
processor.process_vietnamese_text("Giá 50.000đ lúc 15h30")
Disable Preprocessing
# Only apply dictionary replacements and transliteration, skip number/date conversion
normalizer.normalize(text, enable_preprocessing=False)
Processing Pipeline
The normalization follows this pipeline (matching nghitts):
- Unicode normalization (NFC)
- Special character replacement (
&→và,@→a còng, URL/email removal) - Punctuation normalization
- Text cleaning (emojis, non-Latin chars)
- Year range conversion
- Percentage range conversion
- Date/time conversion (including ranges)
- Ordinal conversion
- Thousand separator removal
- Currency conversion
- Remaining percentage conversion
- Phone number conversion
- Decimal conversion
- Measurement unit conversion
- Standalone number conversion
- Lowercase normalization
- Acronym replacement (from CSV)
- Non-Vietnamese word replacement (from CSV)
- Rule-based transliteration (for remaining non-Vietnamese words)
Performance
- ~0.6ms per normalization call with 17K+ dictionary entries
- All regex patterns pre-compiled at initialization
- Dictionary lookups use O(1) hash map instead of regex alternation
- Total initialization time: ~40ms
Requirements
- Python 3.8+
- No external dependencies (uses only standard library)
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
Ported from the JavaScript implementations in nghitts.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vietnormalizer-0.2.1.tar.gz.
File metadata
- Download URL: vietnormalizer-0.2.1.tar.gz
- Upload date:
- Size: 179.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbd3e2bdc5ee61e14b6f714f709e539d0aa8cab44f5c38b31c93250068b90f8f
|
|
| MD5 |
1f6af383cdb48ac64e37bc1d40606210
|
|
| BLAKE2b-256 |
fe80fc894b97f92ca48817e88732a304cdcec9531c94d2a29dc15e1a0397c711
|
File details
Details for the file vietnormalizer-0.2.1-py3-none-any.whl.
File metadata
- Download URL: vietnormalizer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 177.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ef94e4163ff75ae46f0b1c3862c44ff7af94e61660301e4c3a83cdbe6efd988
|
|
| MD5 |
c2f4cdb228de5cc7ca67c38313c4f111
|
|
| BLAKE2b-256 |
f60920f3a4e99ba7f88b52294114e2c08a0807e1f82880d58bbaa3f6da87a929
|