Vietnamese Social Media Lexical Normalization Toolkit
Project description
📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing
ViSoNorm is a specialized toolkit for Vietnamese text normalization and processing, optimized for NLP environments and easily installable via PyPI. Resources (datasets, models) are stored and managed directly on Hugging Face Hub and GitHub Releases.
🚀 Key Features
1. 🔧 BasicNormalizer — Basic Text Normalization
- Case folding: convert entire text to lowercase/uppercase/capitalize.
- Tone normalization: normalize Vietnamese tone marks.
- Basic preprocessing: remove extra whitespace, special characters, sentence formatting.
2. 😀 EmojiHandler — Emoji Processing
- Detect emojis: detect emojis in text.
- Split emoji text: separate emojis from sentences.
- Remove emojis: remove all emojis.
3. ✏️ Lexical Normalization — Social Media Text Normalization
- ViSoLexNormalizer: Normalize text using deep learning models from HuggingFace.
- NswDetector: Detect non-standard words (NSW).
- detect_nsw(): Utility function to detect NSW.
- normalize_sentence(): Utility function to normalize sentences.
4. 📊 Resource Management — Dataset Management
list_datasets()— List available datasets.load_dataset()— Load dataset from GitHub Releases.get_dataset_info()— View detailed dataset information.
5. 🧠 Task Models — Task Processing Models
- SpamReviewDetection — Spam detection.
- HateSpeechDetection — Hate speech detection.
- HateSpeechSpanDetection — Hate speech span detection.
- EmotionRecognition — Emotion recognition.
- AspectSentimentAnalysis — Aspect-based sentiment analysis.
📥 Installation
Install from PyPI (Recommended)
pip install visonorm
Requirements
- Python >= 3.10
- PyTorch >= 1.10.0
- Transformers >= 4.0.0
- scikit-learn >= 0.24.0
- pandas >= 1.3.0
📚 Usage Guide
1. 🔧 BasicNormalizer — Basic Text Normalization
from visonorm import BasicNormalizer
# Initialize BasicNormalizer
normalizer = BasicNormalizer()
# Example text
text = "Hôm nay tôi rất VUI 😊 và HẠNH PHÚC 🎉!"
# Case folding
print(normalizer.case_folding(text, mode='lower'))
# Output: hôm nay tôi rất vui 😊 và hạnh phúc 🎉!
print(normalizer.case_folding(text, mode='upper'))
# Output: HÔM NAY TÔI RẤT VUI 😊 VÀ HẠNH PHÚC 🎉!
print(normalizer.case_folding(text, mode='capitalize'))
# Output: Hôm Nay Tôi Rất Vui 😊 Và Hạnh Phúc 🎉!
# Tone normalization
text2 = "Bận xong rồi. Xoã đi :)"
print(normalizer.tone_normalization(text2))
# Output: Bận xong rồi. Xõa đi :)
# Basic normalization with options
normalized = normalizer.basic_normalizer(
text,
case_folding=True,
mode='lower',
remove_emoji=False,
split_emoji=True
)
print(normalized)
# Output: ['hôm', 'nay', 'tôi', 'rất', 'vui', '😊', 'và', 'hạnh', 'phúc', '🎉', '!']
# Remove emojis
normalized_no_emoji = normalizer.basic_normalizer(
text,
case_folding=True,
remove_emoji=True
)
print(normalized_no_emoji)
# Output: ['hôm', 'nay', 'tôi', 'rất', 'vui', 'và', 'hạnh', 'phúc', '!']
2. 😊 EmojiHandler — Emoji Processing
from visonorm import EmojiHandler
# Initialize EmojiHandler
emoji_handler = EmojiHandler()
text = "Hôm nay tôi rất vui 😊🎉😊 và hạnh phúc 🎉!"
# Detect emojis
emojis = emoji_handler.detect_emoji(text)
print(f"Detected emojis: {emojis}")
# Output: Detected emojis: ['😊🎉😊', '🎉']
# Split emoji text
split_text = emoji_handler.split_emoji_text(text)
print(f"Split emoji text: {split_text}")
# Output: Hôm nay tôi rất vui 😊 🎉 😊 và hạnh phúc 🎉 !
# Split consecutive emojis
text_consecutive = "Hôm nay tôi rất vui 😊🎉😊"
split_consecutive = emoji_handler.split_emoji_emoji(text_consecutive)
print(f"Split consecutive: {split_consecutive}")
# Output: Hôm nay tôi rất vui 😊 🎉 😊
# Remove emojis
text_no_emoji = emoji_handler.remove_emojis(text)
print(f"Text without emojis: {text_no_emoji}")
# Output: Hôm nay tôi rất vui và hạnh phúc !
3. ✏️ Lexical Normalization — Social Media Text Normalization
Using ViSoLexNormalizer
from visonorm import ViSoLexNormalizer
# Initialize with default model (visolex/visobert-normalizer-mix100)
normalizer = ViSoLexNormalizer()
# Or specify a specific model from HuggingFace
# normalizer = ViSoLexNormalizer(model_repo="visolex/visobert-normalizer-mix100")
# normalizer = ViSoLexNormalizer(model_repo="visolex/bartpho-normalizer-mix100")
# Normalize sentence
input_str = "sv dh gia dinh chua cho di lam :))"
normalized = normalizer.normalize_sentence(input_str)
print(f"Original: {input_str}")
print(f"Normalized: {normalized}")
# Output:
# Original: sv dh gia dinh chua cho di lam :))
# Normalized: sinh viên đại học gia đình chưa cho đi làm :))
# Normalize and detect NSW simultaneously
nsw_spans, normalized_text = normalizer.normalize_sentence(input_str, detect_nsw=True)
print(f"Normalized: {normalized_text}")
print("Detected NSW:")
for nsw in nsw_spans:
print(f" - '{nsw['nsw']}' → '{nsw['prediction']}' (confidence: {nsw['confidence_score']})")
# Output:
# Normalized: sinh viên đại học gia đình chưa cho đi làm :))
# Detected NSW:
# - 'sv' → 'sinh viên' (confidence: 1.0)
# - 'dh' → 'đại học' (confidence: 1.0)
# - 'dinh' → 'đình' (confidence: 1.0)
# - 'chua' → 'chưa' (confidence: 1.0)
# - 'di' → 'đi' (confidence: 1.0)
# - 'lam' → 'làm' (confidence: 1.0)
Using NswDetector
from visonorm import NswDetector
# Initialize detector
detector = NswDetector()
# Detect NSW
input_str = "sv dh gia dinh chua cho di lam"
nsw_spans = detector.detect_nsw(input_str)
for nsw in nsw_spans:
print(f"NSW: '{nsw['nsw']}' → '{nsw['prediction']}' (confidence: {nsw['confidence_score']})")
Using Utility Functions
from visonorm import detect_nsw, normalize_sentence
# Detect NSW
nsw_spans = detect_nsw("sv dh gia dinh chua cho di lam")
# Normalize sentence
normalized = normalize_sentence("sv dh gia dinh chua cho di lam")
# Normalize and detect NSW
nsw_spans, normalized = normalize_sentence("sv dh gia dinh chua cho di lam", detect_nsw=True)
4. 📊 Resource Management — Dataset Management
Datasets are stored on GitHub Releases and automatically downloaded when needed.
from visonorm import list_datasets, load_dataset, get_dataset_info
# List all available datasets
datasets = list_datasets()
print("Available datasets:")
for i, dataset in enumerate(datasets, 1):
print(f"{i}. {dataset}")
# Get detailed information about a dataset
info = get_dataset_info("ViLexNorm")
print(f"URL: {info['url']}")
print(f"Type: {info['type']}")
# Load dataset (auto-cached)
df = load_dataset("ViLexNorm")
print(f"Dataset shape: {df.shape}")
print(df.head())
# Force re-download dataset
df = load_dataset("ViLexNorm", force_download=True)
Available datasets:
- ViLexNorm: Vietnamese Lexical Normalization Dataset
- ViHSD: Vietnamese Hate Speech Detection Dataset
- ViHOS: Vietnamese Hate and Offensive Speech Dataset
- UIT-VSMEC: Vietnamese Social Media Emotion Corpus
- ViSpamReviews: Vietnamese Spam Review Detection Dataset
- UIT-ViSFD: Vietnamese Sentiment and Emotion Detection Dataset
- UIT-ViCTSD: Vietnamese Customer Review Sentiment Dataset
- ViTHSD: Vietnamese Toxic Hate Speech Detection Dataset
- BKEE: Vietnamese Emotion Recognition Dataset
- UIT-ViQuAD: Vietnamese Question Answering Dataset
5. 🧠 Task Models — Task Processing Models
All task models are stored on HuggingFace Hub at https://huggingface.co/visolex.
SpamReviewDetection — Spam Detection
from visonorm import SpamReviewDetection
# View available models
models = SpamReviewDetection.list_models()
print("Available models:", SpamReviewDetection.list_model_names())
# Initialize with phobert-v1 model (binary classification)
spam_detector = SpamReviewDetection("phobert-v1")
# Or use other models
# spam_detector = SpamReviewDetection("phobert-v1-multiclass") # Multiclass model
# Detect spam
text = "Sản phẩm rất tốt, chất lượng cao!"
result = spam_detector.predict(text)
print(f"Text: {text}")
print(f"Result: {result}")
# Output: Result: Non-spam
HateSpeechDetection — Hate Speech Detection
from visonorm import HateSpeechDetection
# View available models
print("Available models:", HateSpeechDetection.list_model_names())
# Initialize detector
hate_detector = HateSpeechDetection("phobert-v1")
# Or: HateSpeechDetection("phobert-v2"), HateSpeechDetection("visobert"), etc.
# Detect hate speech
text = "Văn bản cần kiểm tra hate speech"
result = hate_detector.predict(text)
print(f"Result: {result}")
# Output: Result: CLEAN
HateSpeechSpanDetection — Hate Speech Span Detection
from visonorm import HateSpeechSpanDetection
# View available models
print("Available models:", HateSpeechSpanDetection.list_model_names())
# Initialize detector
hate_span_detector = HateSpeechSpanDetection("phobert-v1")
# Or: HateSpeechSpanDetection("vihate-t5"), HateSpeechSpanDetection("visobert"), etc.
# Detect span
text = "Nói cái lồn gì mà khó nghe"
result = hate_span_detector.predict(text)
print(f"Result: {result}")
# Output: {'tokens': [...], 'text': '...'}
EmotionRecognition — Emotion Recognition
from visonorm import EmotionRecognition
# View available models
print("Available models:", EmotionRecognition.list_model_names())
# Initialize detector
emotion_detector = EmotionRecognition("phobert-v2")
# Or: EmotionRecognition("phobert-v1"), EmotionRecognition("visobert"), etc.
# Recognize emotion
text = "Tôi rất vui mừng và hạnh phúc!"
emotion = emotion_detector.predict(text)
print(f"Emotion: {emotion}")
# Output: Emotion: Enjoyment
AspectSentimentAnalysis — Aspect-based Sentiment Analysis
from visonorm import AspectSentimentAnalysis
# View available domains
print("Available domains:", AspectSentimentAnalysis.list_domains())
# View available models for a specific domain
print("Models for smartphone:", AspectSentimentAnalysis.list_model_names("smartphone"))
print("Models for restaurant:", AspectSentimentAnalysis.list_model_names("restaurant"))
print("Models for hotel:", AspectSentimentAnalysis.list_model_names("hotel"))
# Initialize with smartphone domain and phobert model
absa = AspectSentimentAnalysis("smartphone", "phobert")
# Or use other models: "phobert-v2", "bartpho", "vit5", "visobert", etc.
# Or other domains
# absa = AspectSentimentAnalysis("restaurant", "phobert-v1")
# absa = AspectSentimentAnalysis("hotel", "phobert-v1")
# Analyze sentiment
text = "Điện thoại có camera rất tốt nhưng pin nhanh hết"
aspects = absa.predict(text, threshold=0.25)
print(f"Aspects: {aspects}")
# Output: [('BATTERY', 'neutral'), ('FEATURES', 'neutral'), ('PERFORMANCE', 'positive'), ...]
6. 🎯 Advanced Usage — Advanced Usage
Combining Multiple Functions
from visonorm import BasicNormalizer, EmojiHandler, ViSoLexNormalizer
def process_text_advanced(text):
"""Process text with multiple steps"""
print(f"Original text: {text}")
# Step 1: Emoji processing
emoji_handler = EmojiHandler()
emojis = emoji_handler.detect_emoji(text)
print(f"Detected emojis: {emojis}")
# Step 2: Basic normalization
normalizer = BasicNormalizer()
normalized = normalizer.basic_normalizer(text, case_folding=True)
print(f"Basic normalized: {normalized}")
# Step 3: Lexical normalization with deep learning
lex_normalizer = ViSoLexNormalizer()
final_normalized = lex_normalizer.normalize_sentence(text)
print(f"Lexical normalized: {final_normalized}")
return {
'original': text,
'emojis': emojis,
'basic_normalized': normalized,
'lexical_normalized': final_normalized
}
# Test
result = process_text_advanced("Hôm nay tôi rất😊 VUI 😊😊 và HẠNH PHÚC!")
🌐 Resources
HuggingFace Hub
All models and resources are published on HuggingFace Hub:
- Organization: https://huggingface.co/visolex
- Models: View full list at https://huggingface.co/visolex
Available normalization models:
visolex/visobert-normalizer-mix100(default)
GitHub Releases
Datasets are stored as GitHub Releases and automatically downloaded when used:
- Repository: https://github.com/AnhHoang0529/visonorm
- Releases: https://github.com/AnhHoang0529/visonorm/releases
🔬 Examples
See test_toolkit.ipynb for more detailed and comprehensive examples.
📝 Citation
If you use ViSoNorm in your research, please cite:
@misc{visonorm2024,
title={ViSoNorm: Vietnamese Social Media Lexical Normalization Toolkit},
author={Ha Dung Nguyen},
year={2024},
url={https://github.com/AnhHoang0529/visonorm},
note={Available at https://huggingface.co/visolex}
}
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
👥 Authors
- Anh Thi-Hoang Nguyen - Maintainer - anhnth@uit.edu.vn
- Ha Dung Nguyen - Maintainer - dungngh@uit.edu.vn
🙏 Acknowledgments
- HuggingFace for hosting models and providing the transformers library
- The Vietnamese NLP community for datasets and feedback
📞 Contact & Support
- GitHub Issues: https://github.com/AnhHoang0529/visonorm/issues
- Email: anhnth@uit.edu.vn
- HuggingFace: https://huggingface.co/visolex
🔗 Links
- GitHub Repository: https://github.com/AnhHoang0529/visonorm
- PyPI Package: https://pypi.org/project/visonorm/
- HuggingFace Hub: https://huggingface.co/visolex
- Documentation: https://github.com/AnhHoang0529/visonorm
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file visonorm-0.1.0.tar.gz.
File metadata
- Download URL: visonorm-0.1.0.tar.gz
- Upload date:
- Size: 439.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2eced2356291342556da15a5f526921a36d0f3d093360ed143a7a7dae57d5e3d
|
|
| MD5 |
402d5858b51819f565902210989166fb
|
|
| BLAKE2b-256 |
1806ef18ae7f83a7a0a804dc258dbb42de960f66ad9638a5baf76a0a0fc303e4
|