Skip to main content

A Python library for normalizing Vietnamese text for TTS and NLP applications

Project description

Vietnamese Text Normalizer

PyPI version Python 3.8+

A Python library for normalizing Vietnamese text, designed for Text-to-Speech (TTS) and Natural Language Processing (NLP) applications.

Features

  • Number Conversion: Converts numbers to Vietnamese words (e.g., 123một trăm hai mươi ba)
  • Date & Time Normalization: Converts dates and times to Vietnamese words
  • Currency Conversion: Handles VND and USD amounts
  • Percentage Conversion: Converts percentages to Vietnamese words
  • Acronym Expansion: Expands acronyms using dictionary mappings
  • Non-Vietnamese Word Transliteration: Transliterates foreign words to Vietnamese pronunciation
  • Text Cleaning: Removes emojis, special characters, and normalizes Unicode
  • High Performance: Pre-compiled regex patterns for fast processing

Installation

pip install -e .

Or install from PyPI:

pip install vietnormalizer

Or install from source:

git clone https://github.com/nghimestudio/vietnormalizer.git
cd vietnormalizer
pip install -e .
pip install vietnormalizer

Quick Start

from vietnormalizer import VietnameseNormalizer

# Initialize the normalizer
normalizer = VietnameseNormalizer()

# Normalize text
text = "Hôm nay là 25/12/2023, lúc 14:30"
normalized = normalizer.normalize(text)
print(normalized)
# Output: "Hôm nay là ngày hai mươi lăm tháng mười hai năm hai nghìn không trăm hai mươi ba, lúc mười bốn giờ ba mươi"

Usage Examples

Basic Normalization

from vietnormalizer import VietnameseNormalizer

normalizer = VietnameseNormalizer()

# Numbers
normalizer.normalize("Tôi có 123 quyển sách")
# "Tôi có một trăm hai mươi ba quyển sách"

# Dates
normalizer.normalize("Sinh nhật vào 15/08/1990")
# "Sinh nhật vào mười lăm tháng tám năm một nghìn chín trăm chín mươi"

# Times
normalizer.normalize("Cuộc họp lúc 9:30")
# "Cuộc họp lúc chín giờ ba mươi"

# Currency
normalizer.normalize("Giá là 1.500.000 đồng")
# "Giá là một triệu năm trăm nghìn đồng"

# Percentages
normalizer.normalize("Tăng 25% so với năm ngoái")
# "Tăng hai mươi lăm phần trăm so với năm ngoái"

Custom Dictionary Paths

from vietnormalizer import VietnameseNormalizer

# Use custom CSV files
normalizer = VietnameseNormalizer(
    acronyms_path="path/to/custom/acronyms.csv",
    non_vietnamese_words_path="path/to/custom/words.csv"
)

Disable Preprocessing

# Only apply dictionary replacements, skip number/date conversion
normalized = normalizer.normalize(text, enable_preprocessing=False)

Reload Dictionaries

# Reload dictionaries without recreating the normalizer
normalizer.reload_dictionaries(
    acronyms_path="path/to/updated/acronyms.csv"
)

Advanced Usage

Using the Processor Directly

For more control, you can use the VietnameseTextProcessor class directly:

from vietnormalizer import VietnameseTextProcessor

processor = VietnameseTextProcessor()

# Convert numbers only
words = processor.number_to_words("123")
# "một trăm hai mươi ba"

# Process text without dictionary replacements
processed = processor.process_vietnamese_text("Hôm nay là 25/12/2023")

CSV Dictionary Format

Acronyms CSV

acronym,transliteration
USA,Hoa Kỳ
GDP,Tổng sản phẩm quốc nội
AI,trí tuệ nhân tạo

Non-Vietnamese Words CSV

word,vietnamese_pronunciation
hello,xin chào
computer,máy tính
internet,in-tơ-nét

Performance

The library is optimized for performance:

  • All regex patterns are pre-compiled at initialization
  • Dictionary replacements use a single combined regex pass
  • Minimal memory allocations during processing

Requirements

  • Python 3.8+
  • No external dependencies (uses only standard library)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

This library is ported from JavaScript implementations used in Vietnamese TTS systems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietnormalizer-0.1.0.tar.gz (170.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vietnormalizer-0.1.0-py3-none-any.whl (169.5 kB view details)

Uploaded Python 3

File details

Details for the file vietnormalizer-0.1.0.tar.gz.

File metadata

  • Download URL: vietnormalizer-0.1.0.tar.gz
  • Upload date:
  • Size: 170.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vietnormalizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 024b3cba9438e084be0dda0700c62ccb2cb73de0cbac322eec8052a342d21244
MD5 60846ca98087b0ebd0b616b0ea4eba81
BLAKE2b-256 d6a8eaaca88765e3703567df560e8f0570902ba1426fdb950bd6b844bd454199

See more details on using hashes here.

File details

Details for the file vietnormalizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vietnormalizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 169.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vietnormalizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 850ab6476e773f2c01123098d6f15b8f15724ec17a3b11564dd94f5986da69c8
MD5 56833a9341d7802710252385d092c808
BLAKE2b-256 8c7a705137ebbb2dc0780f07e94b8a22475f960c0f1a61fb1391caf159e461c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page