Vietnamese Social Media Lexical Normalization Toolkit

These details have not been verified by PyPI

Project links

Project description

📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing

ViSoNorm is a specialized toolkit for Vietnamese text normalization and processing, optimized for NLP environments and easily installable via PyPI. Resources (datasets, models) are stored and managed directly on Hugging Face Hub and GitHub Releases.

🚀 Key Features

1. 🔧 BasicNormalizer — Basic Text Normalization

Case folding: convert entire text to lowercase/uppercase/capitalize.
Tone normalization: normalize Vietnamese tone marks.
Basic preprocessing: remove extra whitespace, special characters, sentence formatting.

2. 😀 EmojiHandler — Emoji Processing

Detect emojis: detect emojis in text.
Split emoji text: separate emojis from sentences.
Remove emojis: remove all emojis.

3. ✏️ Lexical Normalization — Social Media Text Normalization

ViSoLexNormalizer: Normalize text using deep learning models from HuggingFace.
NswDetector: Detect non-standard words (NSW).
detect_nsw(): Utility function to detect NSW.
normalize_sentence(): Utility function to normalize sentences.

4. 📊 Resource Management — Dataset Management

list_datasets() — List available datasets.
load_dataset() — Load dataset from GitHub Releases.
get_dataset_info() — View detailed dataset information.

5. 🧠 Task Models — Task Processing Models

SpamReviewDetection — Spam detection.
HateSpeechDetection — Hate speech detection.
HateSpeechSpanDetection — Hate speech span detection.
EmotionRecognition — Emotion recognition.
AspectSentimentAnalysis — Aspect-based sentiment analysis.

📥 Installation

Install from PyPI (Recommended)

pip install visonorm

Requirements

Python >= 3.10
PyTorch >= 1.10.0
Transformers >= 4.0.0
scikit-learn >= 0.24.0
pandas >= 1.3.0

📚 Usage Guide

1. 🔧 BasicNormalizer — Basic Text Normalization

from visonorm import BasicNormalizer

# Initialize BasicNormalizer
normalizer = BasicNormalizer()

# Example text
text = "Hôm nay tôi rất VUI 😊 và HẠNH PHÚC 🎉!"

# Case folding
print(normalizer.case_folding(text, mode='lower'))
# Output: hôm nay tôi rất vui 😊 và hạnh phúc 🎉!

print(normalizer.case_folding(text, mode='upper'))
# Output: HÔM NAY TÔI RẤT VUI 😊 VÀ HẠNH PHÚC 🎉!

print(normalizer.case_folding(text, mode='capitalize'))
# Output: Hôm Nay Tôi Rất Vui 😊 Và Hạnh Phúc 🎉!

# Tone normalization
text2 = "Bận xong rồi. Xoã đi :)"
print(normalizer.tone_normalization(text2))
# Output: Bận xong rồi. Xõa đi :)

# Basic normalization with options
normalized = normalizer.basic_normalizer(
    text,
    case_folding=True,
    mode='lower',
    remove_emoji=False,
    split_emoji=True
)
print(normalized)
# Output: ['hôm', 'nay', 'tôi', 'rất', 'vui', '😊', 'và', 'hạnh', 'phúc', '🎉', '!']

# Remove emojis
normalized_no_emoji = normalizer.basic_normalizer(
    text,
    case_folding=True,
    remove_emoji=True
)
print(normalized_no_emoji)
# Output: ['hôm', 'nay', 'tôi', 'rất', 'vui', 'và', 'hạnh', 'phúc', '!']

2. 😊 EmojiHandler — Emoji Processing

from visonorm import EmojiHandler

# Initialize EmojiHandler
emoji_handler = EmojiHandler()

text = "Hôm nay tôi rất vui 😊🎉😊 và hạnh phúc 🎉!"

# Detect emojis
emojis = emoji_handler.detect_emoji(text)
print(f"Detected emojis: {emojis}")
# Output: Detected emojis: ['😊🎉😊', '🎉']

# Split emoji text
split_text = emoji_handler.split_emoji_text(text)
print(f"Split emoji text: {split_text}")
# Output: Hôm nay tôi rất vui 😊 🎉 😊 và hạnh phúc 🎉 !

# Split consecutive emojis
text_consecutive = "Hôm nay tôi rất vui 😊🎉😊"
split_consecutive = emoji_handler.split_emoji_emoji(text_consecutive)
print(f"Split consecutive: {split_consecutive}")
# Output: Hôm nay tôi rất vui 😊 🎉 😊

# Remove emojis
text_no_emoji = emoji_handler.remove_emojis(text)
print(f"Text without emojis: {text_no_emoji}")
# Output: Hôm nay tôi rất vui và hạnh phúc !

3. ✏️ Lexical Normalization — Social Media Text Normalization

Using ViSoLexNormalizer

from visonorm import ViSoLexNormalizer

# Initialize with default model (visolex/visobert-normalizer-mix100)
normalizer = ViSoLexNormalizer()

# Or specify a specific model from HuggingFace
# normalizer = ViSoLexNormalizer(model_repo="visolex/visobert-normalizer-mix100")
# normalizer = ViSoLexNormalizer(model_repo="visolex/bartpho-normalizer-mix100")

# Normalize sentence
input_str = "sv dh gia dinh chua cho di lam :))"
normalized = normalizer.normalize_sentence(input_str)
print(f"Original: {input_str}")
print(f"Normalized: {normalized}")
# Output:
# Original: sv dh gia dinh chua cho di lam :))
# Normalized: sinh viên đại học gia đình chưa cho đi làm :))

# Normalize and detect NSW simultaneously
nsw_spans, normalized_text = normalizer.normalize_sentence(input_str, detect_nsw=True)
print(f"Normalized: {normalized_text}")
print("Detected NSW:")
for nsw in nsw_spans:
    print(f"  - '{nsw['nsw']}' → '{nsw['prediction']}' (confidence: {nsw['confidence_score']})")
# Output:
# Normalized: sinh viên đại học gia đình chưa cho đi làm :))
# Detected NSW:
#   - 'sv' → 'sinh viên' (confidence: 1.0)
#   - 'dh' → 'đại học' (confidence: 1.0)
#   - 'dinh' → 'đình' (confidence: 1.0)
#   - 'chua' → 'chưa' (confidence: 1.0)
#   - 'di' → 'đi' (confidence: 1.0)
#   - 'lam' → 'làm' (confidence: 1.0)

Using NswDetector

from visonorm import NswDetector

# Initialize detector
detector = NswDetector()

# Detect NSW
input_str = "sv dh gia dinh chua cho di lam"
nsw_spans = detector.detect_nsw(input_str)
for nsw in nsw_spans:
    print(f"NSW: '{nsw['nsw']}' → '{nsw['prediction']}' (confidence: {nsw['confidence_score']})")

Using Utility Functions

from visonorm import detect_nsw, normalize_sentence

# Detect NSW
nsw_spans = detect_nsw("sv dh gia dinh chua cho di lam")

# Normalize sentence
normalized = normalize_sentence("sv dh gia dinh chua cho di lam")

# Normalize and detect NSW
nsw_spans, normalized = normalize_sentence("sv dh gia dinh chua cho di lam", detect_nsw=True)

4. 📊 Resource Management — Dataset Management

Datasets are stored on GitHub Releases and automatically downloaded when needed.

from visonorm import list_datasets, load_dataset, get_dataset_info

# List all available datasets
datasets = list_datasets()
print("Available datasets:")
for i, dataset in enumerate(datasets, 1):
    print(f"{i}. {dataset}")

# Get detailed information about a dataset
info = get_dataset_info("ViLexNorm")
print(f"URL: {info['url']}")
print(f"Type: {info['type']}")

# Load dataset (auto-cached)
df = load_dataset("ViLexNorm")
print(f"Dataset shape: {df.shape}")
print(df.head())

# Force re-download dataset
df = load_dataset("ViLexNorm", force_download=True)

Available datasets:

ViLexNorm: Vietnamese Lexical Normalization Dataset
ViHSD: Vietnamese Hate Speech Detection Dataset
ViHOS: Vietnamese Hate and Offensive Speech Dataset
UIT-VSMEC: Vietnamese Social Media Emotion Corpus
ViSpamReviews: Vietnamese Spam Review Detection Dataset
UIT-ViSFD: Vietnamese Sentiment and Emotion Detection Dataset
UIT-ViCTSD: Vietnamese Customer Review Sentiment Dataset
ViTHSD: Vietnamese Toxic Hate Speech Detection Dataset
BKEE: Vietnamese Emotion Recognition Dataset
UIT-ViQuAD: Vietnamese Question Answering Dataset

5. 🧠 Task Models — Task Processing Models

All task models are stored on HuggingFace Hub at https://huggingface.co/visolex.

SpamReviewDetection — Spam Detection

from visonorm import SpamReviewDetection

# View available models
models = SpamReviewDetection.list_models()
print("Available models:", SpamReviewDetection.list_model_names())

# Initialize with phobert-v1 model (binary classification)
spam_detector = SpamReviewDetection("phobert-v1")

# Or use other models
# spam_detector = SpamReviewDetection("phobert-v1-multiclass")  # Multiclass model

# Detect spam
text = "Sản phẩm rất tốt, chất lượng cao!"
result = spam_detector.predict(text)
print(f"Text: {text}")
print(f"Result: {result}")
# Output: Result: Non-spam

HateSpeechDetection — Hate Speech Detection

from visonorm import HateSpeechDetection

# View available models
print("Available models:", HateSpeechDetection.list_model_names())

# Initialize detector
hate_detector = HateSpeechDetection("phobert-v1")
# Or: HateSpeechDetection("phobert-v2"), HateSpeechDetection("visobert"), etc.

# Detect hate speech
text = "Văn bản cần kiểm tra hate speech"
result = hate_detector.predict(text)
print(f"Result: {result}")
# Output: Result: CLEAN

HateSpeechSpanDetection — Hate Speech Span Detection

from visonorm import HateSpeechSpanDetection

# View available models
print("Available models:", HateSpeechSpanDetection.list_model_names())

# Initialize detector
hate_span_detector = HateSpeechSpanDetection("phobert-v1")
# Or: HateSpeechSpanDetection("vihate-t5"), HateSpeechSpanDetection("visobert"), etc.

# Detect span
text = "Nói cái lồn gì mà khó nghe"
result = hate_span_detector.predict(text)
print(f"Result: {result}")
# Output: {'tokens': [...], 'text': '...'}

EmotionRecognition — Emotion Recognition

from visonorm import EmotionRecognition

# View available models
print("Available models:", EmotionRecognition.list_model_names())

# Initialize detector
emotion_detector = EmotionRecognition("phobert-v2")
# Or: EmotionRecognition("phobert-v1"), EmotionRecognition("visobert"), etc.

# Recognize emotion
text = "Tôi rất vui mừng và hạnh phúc!"
emotion = emotion_detector.predict(text)
print(f"Emotion: {emotion}")
# Output: Emotion: Enjoyment

AspectSentimentAnalysis — Aspect-based Sentiment Analysis

from visonorm import AspectSentimentAnalysis

# View available domains
print("Available domains:", AspectSentimentAnalysis.list_domains())

# View available models for a specific domain
print("Models for smartphone:", AspectSentimentAnalysis.list_model_names("smartphone"))
print("Models for restaurant:", AspectSentimentAnalysis.list_model_names("restaurant"))
print("Models for hotel:", AspectSentimentAnalysis.list_model_names("hotel"))

# Initialize with smartphone domain and phobert model
absa = AspectSentimentAnalysis("smartphone", "phobert")
# Or use other models: "phobert-v2", "bartpho", "vit5", "visobert", etc.

# Or other domains
# absa = AspectSentimentAnalysis("restaurant", "phobert-v1")
# absa = AspectSentimentAnalysis("hotel", "phobert-v1")

# Analyze sentiment
text = "Điện thoại có camera rất tốt nhưng pin nhanh hết"
aspects = absa.predict(text, threshold=0.25)
print(f"Aspects: {aspects}")
# Output: [('BATTERY', 'neutral'), ('FEATURES', 'neutral'), ('PERFORMANCE', 'positive'), ...]

6. 🎯 Advanced Usage — Advanced Usage

Combining Multiple Functions

from visonorm import BasicNormalizer, EmojiHandler, ViSoLexNormalizer

def process_text_advanced(text):
    """Process text with multiple steps"""
    print(f"Original text: {text}")
    
    # Step 1: Emoji processing
    emoji_handler = EmojiHandler()
    emojis = emoji_handler.detect_emoji(text)
    print(f"Detected emojis: {emojis}")
    
    # Step 2: Basic normalization
    normalizer = BasicNormalizer()
    normalized = normalizer.basic_normalizer(text, case_folding=True)
    print(f"Basic normalized: {normalized}")
    
    # Step 3: Lexical normalization with deep learning
    lex_normalizer = ViSoLexNormalizer()
    final_normalized = lex_normalizer.normalize_sentence(text)
    print(f"Lexical normalized: {final_normalized}")
    
    return {
        'original': text,
        'emojis': emojis,
        'basic_normalized': normalized,
        'lexical_normalized': final_normalized
    }

# Test
result = process_text_advanced("Hôm nay tôi rất😊 VUI 😊😊 và HẠNH PHÚC!")

🌐 Resources

HuggingFace Hub

All models and resources are published on HuggingFace Hub:

Organization: https://huggingface.co/visolex
Models: View full list at https://huggingface.co/visolex

Available normalization models:

visolex/visobert-normalizer-mix100 (default)

GitHub Releases

Datasets are stored as GitHub Releases and automatically downloaded when used:

Repository: https://github.com/AnhHoang0529/visonorm
Releases: https://github.com/AnhHoang0529/visonorm/releases

🔬 Examples

See test_toolkit.ipynb for more detailed and comprehensive examples.

📝 Citation

If you use ViSoNorm in your research, please cite:

@misc{visonorm2024,
  title={ViSoNorm: Vietnamese Social Media Lexical Normalization Toolkit},
  author={Ha Dung Nguyen},
  year={2024},
  url={https://github.com/AnhHoang0529/visonorm},
  note={Available at https://huggingface.co/visolex}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

Anh Thi-Hoang Nguyen - Maintainer - anhnth@uit.edu.vn
Ha Dung Nguyen - Maintainer - dungngh@uit.edu.vn

🙏 Acknowledgments

HuggingFace for hosting models and providing the transformers library
The Vietnamese NLP community for datasets and feedback

📞 Contact & Support

GitHub Issues: https://github.com/AnhHoang0529/visonorm/issues
Email: anhnth@uit.edu.vn
HuggingFace: https://huggingface.co/visolex

🔗 Links

GitHub Repository: https://github.com/AnhHoang0529/visonorm
PyPI Package: https://pypi.org/project/visonorm/
HuggingFace Hub: https://huggingface.co/visolex
Documentation: https://github.com/AnhHoang0529/visonorm

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Mar 13, 2026

0.1.3

Mar 13, 2026

0.1.2

Mar 13, 2026

0.1.1

Jan 6, 2026

This version

0.1.0

Jan 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visonorm-0.1.0.tar.gz (439.4 kB view details)

Uploaded Jan 6, 2026 Source

File details

Details for the file visonorm-0.1.0.tar.gz.

File metadata

Download URL: visonorm-0.1.0.tar.gz
Upload date: Jan 6, 2026
Size: 439.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for visonorm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2eced2356291342556da15a5f526921a36d0f3d093360ed143a7a7dae57d5e3d`
MD5	`402d5858b51819f565902210989166fb`
BLAKE2b-256	`1806ef18ae7f83a7a0a804dc258dbb42de960f66ad9638a5baf76a0a0fc303e4`

See more details on using hashes here.

visonorm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing

🚀 Key Features

1. 🔧 BasicNormalizer — Basic Text Normalization

2. 😀 EmojiHandler — Emoji Processing

3. ✏️ Lexical Normalization — Social Media Text Normalization

4. 📊 Resource Management — Dataset Management

5. 🧠 Task Models — Task Processing Models

📥 Installation

Install from PyPI (Recommended)

Requirements

📚 Usage Guide

1. 🔧 BasicNormalizer — Basic Text Normalization

2. 😊 EmojiHandler — Emoji Processing

3. ✏️ Lexical Normalization — Social Media Text Normalization

Using ViSoLexNormalizer

Using NswDetector

Using Utility Functions

4. 📊 Resource Management — Dataset Management

5. 🧠 Task Models — Task Processing Models

SpamReviewDetection — Spam Detection

HateSpeechDetection — Hate Speech Detection

HateSpeechSpanDetection — Hate Speech Span Detection

EmotionRecognition — Emotion Recognition

AspectSentimentAnalysis — Aspect-based Sentiment Analysis

6. 🎯 Advanced Usage — Advanced Usage

Combining Multiple Functions

🌐 Resources

HuggingFace Hub

GitHub Releases

🔬 Examples

📝 Citation

📄 License

👥 Authors

🙏 Acknowledgments

📞 Contact & Support

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes