An effective text normalization tool for Vietnamese

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Soe Vinorm - Vietnamese Text Normalization Toolkit

Soe Vinorm is an effective and extensible toolkit for Vietnamese text normalization, designed for use in Text-to-Speech (TTS) and NLP pipelines. It detects and expands non-standard words (NSWs) such as numbers, dates, abbreviations, and more, converting them into their spoken forms. This project is based on the paper Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech.

Installation

Option 1: Clone the repository (for development)

# Clone the repository
git clone https://github.com/vinhdq842/soe-vinorm.git
cd soe-vinorm

# Install dependencies including development dependencies (using uv)
uv sync --dev

Option 2: Install from PyPI

# Install using uv
uv add soe-vinorm

# Or using pip
pip install soe-vinorm

Option 3: Install from source

# Install directly from GitHub
uv pip install git+https://github.com/vinhdq842/soe-vinorm.git

Usage

from soe_vinorm import SoeNormalizer

normalizer = SoeNormalizer()
text = 'Từ năm 2021 đến nay, đây là lần thứ 3 Bộ Công an xây dựng thông tư để quy định liên quan đến mẫu hộ chiếu, giấy thông hành.'

result = normalizer.normalize(text)
print(result)
# Output: Từ năm hai nghìn không trăm hai mươi mốt đến nay , đây là lần thứ ba Bộ Công an xây dựng thông tư để quy định liên quan đến mẫu hộ chiếu , giấy thông hành .

Quick function usage

from soe_vinorm import normalize_text

text = "1kg dâu 25 quả, giá 700.000 - Trung bình 30.000đ/quả"
result = normalize_text(text)
print(result)
# Output: một ki lô gam dâu hai mươi lăm quả , giá bảy trăm nghìn - Trung bình ba mươi nghìn đồng trên quả

Batch processing

from soe_vinorm import batch_normalize_texts

texts = [
    "Tôi có 123.456 đồng trong tài khoản",
    "ĐT Việt Nam giành HCV tại SEA Games 32",
    "Nhiệt độ hôm nay là 25°C, ngày 25/04/2014",
    "Tốc độ xe đạt 60km/h trên quãng đường 150km"
]

# Process multiple texts in parallel (4 worker processes)
results = batch_normalize_texts(texts, n_jobs=4)

for original, normalized in zip(texts, results):
    print(f"Original: {original}")
    print(f"Normalized: {normalized}")
    print("-" * 50)

Output:

Original: Tôi có 123.456 đồng trong tài khoản
Normalized: Tôi có một trăm hai mươi ba nghìn bốn trăm năm mươi sáu đồng trong tài khoản
--------------------------------------------------
Original: ĐT Việt Nam giành HCV tại SEA Games 32
Normalized: đội tuyển Việt Nam giành Huy chương vàng tại SEA Games ba mươi hai
--------------------------------------------------
Original: Nhiệt độ hôm nay là 25°C, ngày 25/04/2014
Normalized: Nhiệt độ hôm nay là hai mươi lăm độ xê , ngày hai mươi lăm tháng bốn năm hai nghìn không trăm mười bốn
--------------------------------------------------
Original: Tốc độ xe đạt 60km/h trên quãng đường 150km
Normalized: Tốc độ xe đạt sáu mươi ki lô mét trên giờ trên quãng đường một trăm năm mươi ki lô mét
--------------------------------------------------

Approach: Two-stage normalization

Preprocessing & tokenizing

The extra spaces, ASCII arts, emojis, HTML entities, unspoken words, etc. are removed.
A Regex-based tokenizer is then used to split the very sentence into tokens.

Stage 1: Non-standard word detection

Use a sequence tagger to extract non-standard words (NSWs) and categorize them into different types (18 in total).
Later, these NSWs can be verbalized properly according to their types.
The sequence tagger can be any kind of sequence labeling models. This implementation uses Conditional Random Field due to the shortage of data.

Stage 2: Non-standard word normalization

With the NSWs detected in Stage 1 and their respective types, Regex-based expanders are applied to get the normalized results.
Each NSW type has its own dedicated expander.
The normalized results are then inserted into the original sentence, resulting in the desired normalized sentence.

Minor details

Foreign NSWs are kept as is at the moment.
To expand Abbreviation NSWs, a language model is used (i.e. BERT), incorporated with a Vietnamese abbreviation dictionary.
...

Testing

Run all tests with:

pytest tests

Author

Vinh Dang (quangvinh0842@gmail.com)

License

MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vinhdq842

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.2

Oct 17, 2025

0.3.1

Oct 7, 2025

0.3.0

Oct 5, 2025

0.2.2

Sep 7, 2025

0.2.1

Aug 27, 2025

This version

0.2.0

Aug 13, 2025

0.1.7

Aug 10, 2025

0.1.6

Aug 2, 2025

0.1.5

Jul 1, 2025

0.1.4

Jun 30, 2025

0.1.3

Jun 26, 2025

0.1.2

Jun 25, 2025

0.1.1

Jun 24, 2025

0.1.0

Jun 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soe_vinorm-0.2.0.tar.gz (191.4 kB view details)

Uploaded Aug 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

soe_vinorm-0.2.0-py3-none-any.whl (86.7 kB view details)

Uploaded Aug 13, 2025 Python 3

File details

Details for the file soe_vinorm-0.2.0.tar.gz.

File metadata

Download URL: soe_vinorm-0.2.0.tar.gz
Upload date: Aug 13, 2025
Size: 191.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.7.19

File hashes

Hashes for soe_vinorm-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7854180f9170e54754e9b3a30a6e41d74ba1300e503ce4bea6329f73e8293262`
MD5	`70d0328e596a61f0dc7ae251af9c1fa9`
BLAKE2b-256	`392c8d75597e17ade43b5741b0cf15f4fdc540f750c133e97c9da3ef16ad3cd0`

See more details on using hashes here.

File details

Details for the file soe_vinorm-0.2.0-py3-none-any.whl.

File metadata

Download URL: soe_vinorm-0.2.0-py3-none-any.whl
Upload date: Aug 13, 2025
Size: 86.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.7.19

File hashes

Hashes for soe_vinorm-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02b7e952f78913df9913fba7ad70288086ca2b76e8c8ccf46c6fd67917148261`
MD5	`e39d9dc341a217c1aac24cc99385c3f4`
BLAKE2b-256	`79eb5f73ed93909c0b65bd4c5c6222eccdcdb0bfaa92f0dc742b2e469c9ea5c2`

See more details on using hashes here.

soe-vinorm 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Soe Vinorm - Vietnamese Text Normalization Toolkit

Installation

Option 1: Clone the repository (for development)

Option 2: Install from PyPI

Option 3: Install from source

Usage

Quick function usage

Batch processing

Approach: Two-stage normalization

Preprocessing & tokenizing

Stage 1: Non-standard word detection

Stage 2: Non-standard word normalization

Minor details

Testing

Author

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes