Full NLP toolkit for Malaysian Manglish - 51 modules, zero dependencies for core
Project description
malaysian-manglish-nlp
Full NLP toolkit for Malaysian Manglish — 51 modules, zero dependencies for core.
Built for real-world Malaysian text: social media, news, chat messages, code-switched Malay-English content.
Installation
pip install malaysian-manglish-nlp
Extras
pip install malaysian-manglish-nlp[transformers] # HuggingFace transformer models
pip install malaysian-manglish-nlp[embeddings] # Word2Vec/FastText embeddings
pip install malaysian-manglish-nlp[spacy] # spaCy integration
pip install malaysian-manglish-nlp[api] # FastAPI REST API
pip install malaysian-manglish-nlp[langchain] # LangChain tools
pip install malaysian-manglish-nlp[all] # Everything
Quick Start
from malaysian_manglish_nlp import sentiment, normalize, ner, detect_language
# Sentiment analysis
result = sentiment("Weh best gila makanan dia!")
print(result)
# {'sentiment': 'positive', 'score': 0.94, 'raw_score': 2.5}
# Text normalization
clean = normalize("xpe la bro, aku ok je")
print(clean) # "tidak apa la bro, aku okay sahaja"
# Named Entity Recognition
entities = ner("Ali pergi Pavilion KL semalam")
print(entities)
# [{'text': 'Ali', 'type': 'PERSON', 'start': 0, 'end': 3},
# {'text': 'Pavilion KL', 'type': 'LOCATION', 'start': 10, 'end': 21}]
# Language detection
lang = detect_language("Eh jom makan, I'm hungry gila")
print(lang)
# {'language': 'manglish', 'confidence': 0.87}
HuggingFace
| Link | |
|---|---|
| 🤗 Model | vexccz/manglish-nlp-sentiment — XLM-Roberta multi-task (sentiment 95.0%, emotion 90.3%, intent 97.5%) |
| 🤗 Dataset | vexccz/manglish-nlp-dataset — 14,384 labeled Manglish examples |
| 🤗 Demo | vexccz/manglish-nlp-demo — Gradio interactive demo (7 tabs) |
Features
Text Processing
- normalize — Expand shortforms (638+ mappings: nk→nak, mcm→macam, sbb→sebab)
- clean — Remove URLs, mentions, repeated chars, HTML
- formalize — Convert informal to formal Malay (aku→saya, ko→anda)
- tokenize — Malaysian-aware tokenizer (handles URLs, hashtags, emoticons)
- stemmer — Malay stemmer with nasal assimilation (250+ roots)
- segment — Sentence segmentation for code-switched text
- spelling — Spell checking with Malaysian dictionary
Analysis
- sentiment — Sentiment analysis with aspect-based (food, service, price, etc.)
- emotion — 8 emotion categories (happy, sad, angry, fear, surprise, disgust, love, neutral)
- sarcasm — Sarcasm detection for Malaysian text
- hate_speech — Hate speech detection (6 categories, severity levels)
- intent — 8 intent types (question, request, complaint, greeting, opinion, statement, command, offer)
- topic — 12 topic classification (food, politics, sports, tech, education, etc.)
- stance — Stance detection (support/oppose/neutral)
- profanity — Profanity detection with leetspeak evasion handling
Entity & Structure
- ner — Named Entity Recognition (11 types: PERSON, ORG, LOC, PRODUCT, EVENT, MONEY, PHONE, EMAIL, DATE, TIME, PERCENT)
- pos_tag — Part-of-speech tagging (15 tags)
- dependency — Dependency parsing (SVO extraction)
- coreference — Pronoun resolution with Malaysian gender heuristics
- keywords — Keyword extraction (frequency, RAKE, TF-IDF, TextRank)
Language Detection & Code-Switching
- language — Language identification (Malay/English/Manglish/Mixed)
- code_switching — Code-switching point detection, switch ratio, segmentation by language
- dialect — 6 Malay dialects (Standard, Kelantan, Terengganu, N9, Kedah, Sarawak, Sabah) with normalization
Semantic & Similarity
- similarity — Text similarity (Jaccard, cosine, overlap, semantic)
- embeddings — Word2Vec/FastText trained on Malaysian social media (518 vocab, 100d)
- augmentation — Text augmentation (synonym replacement, shortform variation)
Generation & Understanding
- translation — Rule-based BM↔EN translation (1000+ word pairs, phrase translation)
- summarization — Extractive summarization using TextRank algorithm
- text_generation — N-gram based text generation and autocomplete
- qa — Extractive question answering with TF-IDF retrieval
- discourse — Argument mining and fallacy detection
- ocr_normalize — OCR text correction for Malaysian documents
Preprocessing & Utilities
- normalizer — Advanced normalization (money, dates, times, elongated text)
- dictionary — Malay-English dictionary lookup
- similarity — Multiple similarity metrics
- pipeline — Chain multiple modules together
- calibration — Confidence scoring for predictions
- hybrid_ml — Feature extraction + logistic classifier
- evaluate — Model evaluation and regression tracking
- cache — LRU caching for performance
- profiler — Performance benchmarking tools
- tuning — Hyperparameter tuning and threshold optimization
Integration
- spacy_integration — Custom spaCy Language class and pipeline components
- rest_api — FastAPI REST API with rate limiting and CORS
- langchain_tool — LangChain tool wrappers
- CLI — Command-line interface with subcommands
Fine-Tuned Model
An XLM-Roberta multi-task model fine-tuned on 14,384 labeled Manglish examples for sentiment, emotion, and intent classification.
from malaysian_manglish_nlp.transformers.manglish_model import load_model, predict
model = load_model() # Auto-downloads from HuggingFace on first use
result = predict("gila best servis ni")
# {'sentiment': {'label': 'positive', 'confidence': 0.96},
# 'emotion': {'label': 'happy', 'confidence': 0.85},
# 'intent': {'label': 'opinion', 'confidence': 1.00}}
| Task | Accuracy | Classes |
|---|---|---|
| Sentiment | 95.0% | positive, negative, neutral |
| Emotion | 90.3% | happy, sad, angry, fear, surprise, disgust, love, neutral |
| Intent | 97.5% | question, statement, request, complaint, greeting, opinion, command, offer |
| Average | 94.3% | — |
Model: vexccz/manglish-nlp-sentiment on HuggingFace. Requires pip install malaysian-manglish-nlp[transformers].
Performance
- 23,000+ texts/sec sentiment analysis throughput (rule-based)
- <0.5s import time for core modules
- Zero dependencies for core text processing
- LRU caching on heavy operations (stemmer, normalize, sentiment, language detection)
- Lazy loading — only imports what you use
- Pre-compiled regex patterns across 6 modules
Comparison with Malaya
| Feature | malaysian-manglish-nlp | Malaya |
|---|---|---|
| Core dependencies | None | TensorFlow/PyTorch required |
| Import time | <0.5s | 10-30s |
| Manglish-first | Built for informal MY text | Formal BM focus |
| Modules | 51 | ~40 |
| Throughput | 23k+ texts/sec | Varies (GPU recommended) |
| Python support | 3.8-3.12 | 3.8+ |
| Aspect sentiment | ✅ | ❌ |
| Code-switching detection | ✅ | ❌ |
| Hate speech detection | ✅ | Limited |
| Discourse analysis | ✅ | ❌ |
| OCR normalization | ✅ | ❌ |
| Translation (rule-based) | ✅ | ❌ |
| Text generation | ✅ | ❌ |
Both are solid choices. Malaya excels at formal Bahasa Melayu with deep learning models. malaysian-manglish-nlp is optimized for informal, code-switched Malaysian text with minimal overhead and advanced NLP features.
CLI Usage
# Full analysis
manglish analyze "Weh best gila makanan dia!"
# Sentiment
manglish sentiment "Teruk la service kat sini"
# Normalize shortforms
manglish normalize "xpe la bro aku otw"
# Translate
manglish translate "Aku nak pergi makan" --to en
# NER
manglish ner "Ahmad kerja kat Google Malaysia"
# Summarize file
manglish summarize --file article.txt
# Run benchmarks
manglish benchmark
# Profile performance
manglish profile "Sample text here"
REST API
# Start API server
uvicorn malaysian_manglish_nlp.rest_api:app --port 8000
# Or with Docker
docker-compose up -d
Endpoints:
POST /analyze— Full analysisPOST /sentiment— Sentiment onlyPOST /normalize— Normalize textPOST /translate— Translate textPOST /ner— Named entitiesPOST /pos— POS tagsPOST /summarize— Summarize textPOST /batch— Batch process multiple textsGET /health— Health checkGET /modules— List available modules
Testing
# Run all tests (900+ tests)
python -m pytest tests/ -q
# Run specific test file
python -m pytest tests/test_sentiment.py -v
# Run with coverage
python -m pytest tests/ --cov=malaysian_manglish_nlp
# Run heavy tests (requires gensim)
RUN_HEAVY_TESTS=1 python -m pytest tests/test_word_embeddings.py -v
Documentation
Full documentation available at manglish-nlp.readthedocs.io
Includes:
- Module reference for all 51 modules
- API documentation with examples
- Performance benchmarks
- Comparison with Malaya
- Contributing guide
- Changelog
Contributing
Contributions welcome! Areas where help is needed:
- More training data — Manglish text samples from social media
- Dialect support — More regional variants and normalization rules
- Benchmarks — Comparative benchmarks on Malaysian NLP datasets
- Documentation — More usage examples and tutorials
git clone https://github.com/ZafranYusof/malaysia-manglish-nlp.git
cd malaysian-manglish-nlp
pip install -e ".[all]"
python -m pytest tests/ -q
License
MIT — see LICENSE for details.
Citation
If you use malaysian-manglish-nlp in your research, please cite:
@software{malaysian_manglish_nlp,
author = {Zafran},
title = {malaysian-manglish-nlp: Full NLP toolkit for Malaysian Manglish},
year = {2026},
url = {https://github.com/ZafranYusof/malaysia-manglish-nlp},
version = {3.2.0}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file malaysian_manglish_nlp-3.3.0.tar.gz.
File metadata
- Download URL: malaysian_manglish_nlp-3.3.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a34084e1fb5d2dec8dfd419daee1238e15780292593cbf3e262c80eb2ab12b79
|
|
| MD5 |
dd4ea9d8db75de5e059432ee9510fda3
|
|
| BLAKE2b-256 |
063ea553c58acb2538275893db61d77fa029be40402ba532fa9dc4ca49f7c288
|
File details
Details for the file malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl.
File metadata
- Download URL: malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.14, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd37419f6f62c2184611440a968c52c08c933a7f1d850b97189f8e6d99290b4e
|
|
| MD5 |
f65d68be989f471ef910ff74f4afc9bc
|
|
| BLAKE2b-256 |
05b5e7259d81775d63b7ea67911714320989d4ae1ad489edce7a843bca10140a
|