A Python library for normalizing Vietnamese text for TTS and NLP applications
Project description
Vietnamese Text Normalizer
A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications.
Features
- Number Conversion: Converts numbers to Vietnamese words (e.g.,
123→một trăm hai mươi ba) - Date & Time Normalization: Converts dates and times to Vietnamese words
- Currency Conversion: Handles VND and USD amounts
- Percentage Conversion: Converts percentages to Vietnamese words
- Acronym Expansion: Expands acronyms using dictionary mappings
- Non-Vietnamese Word Transliteration: Transliterates foreign words to Vietnamese pronunciation
- Text Cleaning: Removes emojis, special characters, and normalizes Unicode
- High Performance: Pre-compiled regex patterns for fast processing
Installation
pip install -e .
Or install from PyPI:
pip install vietnormalizer
Or install from source:
git clone https://github.com/nghimestudio/vietnormalizer.git
cd vietnormalizer
pip install -e .
pip install vietnormalizer
Quick Start
from vietnormalizer import VietnameseNormalizer
# Initialize the normalizer
normalizer = VietnameseNormalizer()
# Example 1: Numbers, dates, and times
text = "Hôm nay là 25/12/2023, lúc 14:30"
normalized = normalizer.normalize(text)
print(normalized)
# Output: "hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi phút"
# Example 2: Acronym expansion (from built-in dictionary)
text = "Tôi làm việc tại NASA và xem TV"
normalized = normalizer.normalize(text)
print(normalized)
# Output: "Tôi làm việc tại na-sa và xem Ti vi"
# Example 3: Non-Vietnamese word transliteration (from built-in dictionary)
text = "Hello container from Singapore"
normalized = normalizer.normalize(text)
print(normalized)
# Output: "hê-lô công-tê-nơ phờ-rôm xin-ga-po"
# Example 4: Combined - numbers, acronyms, and foreign words
text = "Giá container là 1.500.000 đồng, giao hàng từ Singapore"
normalized = normalizer.normalize(text)
print(normalized)
# Output: "Giá công-tê-nơ là một triệu năm trăm nghìn đồng, giao hàng từ xin-ga-po"
Usage Examples
Basic Normalization
from vietnormalizer import VietnameseNormalizer
normalizer = VietnameseNormalizer()
# Numbers
normalizer.normalize("Tôi có 123 quyển sách")
# "Tôi có một trăm hai mươi ba quyển sách"
# Dates
normalizer.normalize("Sinh nhật vào 15/08/1990")
# "Sinh nhật vào mười lăm tháng tám năm một nghìn chín trăm chín mươi"
# Times
normalizer.normalize("Cuộc họp lúc 9:30")
# "Cuộc họp lúc chín giờ ba mươi"
# Currency
normalizer.normalize("Giá là 1.500.000 đồng")
# "Giá là một triệu năm trăm nghìn đồng"
# Percentages
normalizer.normalize("Tăng 25% so với năm ngoái")
# "Tăng hai mươi lăm phần trăm so với năm ngoái"
Custom Dictionary Paths
from vietnormalizer import VietnameseNormalizer
# Use custom CSV files
normalizer = VietnameseNormalizer(
acronyms_path="path/to/custom/acronyms.csv",
non_vietnamese_words_path="path/to/custom/words.csv"
)
Disable Preprocessing
# Only apply dictionary replacements, skip number/date conversion
normalized = normalizer.normalize(text, enable_preprocessing=False)
Reload Dictionaries
# Reload dictionaries without recreating the normalizer
normalizer.reload_dictionaries(
acronyms_path="path/to/updated/acronyms.csv"
)
Advanced Usage
Using the Processor Directly
For more control, you can use the VietnameseTextProcessor class directly:
from vietnormalizer import VietnameseTextProcessor
processor = VietnameseTextProcessor()
# Convert numbers only
words = processor.number_to_words("123")
# "một trăm hai mươi ba"
# Process text without dictionary replacements
processed = processor.process_vietnamese_text("Hôm nay là 25/12/2023")
CSV Dictionary Format
Acronyms CSV
acronym,transliteration
USA,Hoa Kỳ
GDP,Tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo
Non-Vietnamese Words CSV
original,transliteration
original,ô-ri-gin-nồ
container,công-tê-nơ
singapore,xin-ga-po
Performance
The library is optimized for performance:
- All regex patterns are pre-compiled at initialization
- Dictionary replacements use a single combined regex pass
- Minimal memory allocations during processing
Requirements
- Python 3.8+
- No external dependencies (uses only standard library)
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
This library is ported from JavaScript implementations used in Vietnamese TTS systems.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vietnormalizer-0.1.2.tar.gz.
File metadata
- Download URL: vietnormalizer-0.1.2.tar.gz
- Upload date:
- Size: 171.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06618908ba6c38919346248879f7f374bebf6606824ba2bfdc57520b7b6a9d06
|
|
| MD5 |
063dae2a62c2b7e3673075b12ac13a84
|
|
| BLAKE2b-256 |
2ba1cb9ef45130d3cf1fc0ce7667c1f9a7029e7d441c3cdc8cf8600e795e9ca3
|
File details
Details for the file vietnormalizer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: vietnormalizer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 169.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7639dd70e9cabc9de470c68eb0c111831e57588475678956939e19707fba29d7
|
|
| MD5 |
f88c598e392cadeb1065e796594ea9b1
|
|
| BLAKE2b-256 |
d65d6ea3cc992a43fa04e21e7fd030f8e66195154c6a8e6b22ed9e9bb54e073f
|