Skip to main content

Full NLP toolkit for Malaysian Manglish - 51 modules, zero dependencies for core

Project description

malaysian-manglish-nlp

PyPI version Python versions Docs HuggingFace Dataset Demo License: MIT

Full NLP toolkit for Malaysian Manglish — 51 modules, zero dependencies for core.

Built for real-world Malaysian text: social media, news, chat messages, code-switched Malay-English content.

Installation

pip install malaysian-manglish-nlp

Extras

pip install malaysian-manglish-nlp[transformers]   # HuggingFace transformer models
pip install malaysian-manglish-nlp[embeddings]     # Word2Vec/FastText embeddings
pip install malaysian-manglish-nlp[spacy]          # spaCy integration
pip install malaysian-manglish-nlp[api]            # FastAPI REST API
pip install malaysian-manglish-nlp[langchain]      # LangChain tools
pip install malaysian-manglish-nlp[all]            # Everything

Quick Start

from malaysian_manglish_nlp import sentiment, normalize, ner, detect_language

# Sentiment analysis
result = sentiment("Weh best gila makanan dia!")
print(result)
# {'sentiment': 'positive', 'score': 0.94, 'raw_score': 2.5}

# Text normalization
clean = normalize("xpe la bro, aku ok je")
print(clean)  # "tidak apa la bro, aku okay sahaja"

# Named Entity Recognition
entities = ner("Ali pergi Pavilion KL semalam")
print(entities)
# [{'text': 'Ali', 'type': 'PERSON', 'start': 0, 'end': 3},
#  {'text': 'Pavilion KL', 'type': 'LOCATION', 'start': 10, 'end': 21}]

# Language detection
lang = detect_language("Eh jom makan, I'm hungry gila")
print(lang)
# {'language': 'manglish', 'confidence': 0.87}

HuggingFace

Link
🤗 Model vexccz/manglish-nlp-sentiment — XLM-Roberta multi-task (sentiment 95.0%, emotion 90.3%, intent 97.5%)
🤗 Dataset vexccz/manglish-nlp-dataset — 14,384 labeled Manglish examples
🤗 Demo vexccz/manglish-nlp-demo — Gradio interactive demo (7 tabs)

Features

Text Processing

  • normalize — Expand shortforms (638+ mappings: nk→nak, mcm→macam, sbb→sebab)
  • clean — Remove URLs, mentions, repeated chars, HTML
  • formalize — Convert informal to formal Malay (aku→saya, ko→anda)
  • tokenize — Malaysian-aware tokenizer (handles URLs, hashtags, emoticons)
  • stemmer — Malay stemmer with nasal assimilation (250+ roots)
  • segment — Sentence segmentation for code-switched text
  • spelling — Spell checking with Malaysian dictionary

Analysis

  • sentiment — Sentiment analysis with aspect-based (food, service, price, etc.)
  • emotion — 8 emotion categories (happy, sad, angry, fear, surprise, disgust, love, neutral)
  • sarcasm — Sarcasm detection for Malaysian text
  • hate_speech — Hate speech detection (6 categories, severity levels)
  • intent — 8 intent types (question, request, complaint, greeting, opinion, statement, command, offer)
  • topic — 12 topic classification (food, politics, sports, tech, education, etc.)
  • stance — Stance detection (support/oppose/neutral)
  • profanity — Profanity detection with leetspeak evasion handling

Entity & Structure

  • ner — Named Entity Recognition (11 types: PERSON, ORG, LOC, PRODUCT, EVENT, MONEY, PHONE, EMAIL, DATE, TIME, PERCENT)
  • pos_tag — Part-of-speech tagging (15 tags)
  • dependency — Dependency parsing (SVO extraction)
  • coreference — Pronoun resolution with Malaysian gender heuristics
  • keywords — Keyword extraction (frequency, RAKE, TF-IDF, TextRank)

Language Detection & Code-Switching

  • language — Language identification (Malay/English/Manglish/Mixed)
  • code_switching — Code-switching point detection, switch ratio, segmentation by language
  • dialect — 6 Malay dialects (Standard, Kelantan, Terengganu, N9, Kedah, Sarawak, Sabah) with normalization

Semantic & Similarity

  • similarity — Text similarity (Jaccard, cosine, overlap, semantic)
  • embeddings — Word2Vec/FastText trained on Malaysian social media (518 vocab, 100d)
  • augmentation — Text augmentation (synonym replacement, shortform variation)

Generation & Understanding

  • translation — Rule-based BM↔EN translation (1000+ word pairs, phrase translation)
  • summarization — Extractive summarization using TextRank algorithm
  • text_generation — N-gram based text generation and autocomplete
  • qa — Extractive question answering with TF-IDF retrieval
  • discourse — Argument mining and fallacy detection
  • ocr_normalize — OCR text correction for Malaysian documents

Preprocessing & Utilities

  • normalizer — Advanced normalization (money, dates, times, elongated text)
  • dictionary — Malay-English dictionary lookup
  • similarity — Multiple similarity metrics
  • pipeline — Chain multiple modules together
  • calibration — Confidence scoring for predictions
  • hybrid_ml — Feature extraction + logistic classifier
  • evaluate — Model evaluation and regression tracking
  • cache — LRU caching for performance
  • profiler — Performance benchmarking tools
  • tuning — Hyperparameter tuning and threshold optimization

Integration

  • spacy_integration — Custom spaCy Language class and pipeline components
  • rest_api — FastAPI REST API with rate limiting and CORS
  • langchain_tool — LangChain tool wrappers
  • CLI — Command-line interface with subcommands

Fine-Tuned Model

An XLM-Roberta multi-task model fine-tuned on 14,384 labeled Manglish examples for sentiment, emotion, and intent classification.

from malaysian_manglish_nlp.transformers.manglish_model import load_model, predict

model = load_model()  # Auto-downloads from HuggingFace on first use
result = predict("gila best servis ni")
# {'sentiment': {'label': 'positive', 'confidence': 0.96},
#  'emotion':    {'label': 'happy',    'confidence': 0.85},
#  'intent':     {'label': 'opinion',  'confidence': 1.00}}
Task Accuracy Classes
Sentiment 95.0% positive, negative, neutral
Emotion 90.3% happy, sad, angry, fear, surprise, disgust, love, neutral
Intent 97.5% question, statement, request, complaint, greeting, opinion, command, offer
Average 94.3%

Model: vexccz/manglish-nlp-sentiment on HuggingFace. Requires pip install malaysian-manglish-nlp[transformers].

Performance

  • 23,000+ texts/sec sentiment analysis throughput (rule-based)
  • <0.5s import time for core modules
  • Zero dependencies for core text processing
  • LRU caching on heavy operations (stemmer, normalize, sentiment, language detection)
  • Lazy loading — only imports what you use
  • Pre-compiled regex patterns across 6 modules

Comparison with Malaya

Feature malaysian-manglish-nlp Malaya
Core dependencies None TensorFlow/PyTorch required
Import time <0.5s 10-30s
Manglish-first Built for informal MY text Formal BM focus
Modules 51 ~40
Throughput 23k+ texts/sec Varies (GPU recommended)
Python support 3.8-3.12 3.8+
Aspect sentiment
Code-switching detection
Hate speech detection Limited
Discourse analysis
OCR normalization
Translation (rule-based)
Text generation

Both are solid choices. Malaya excels at formal Bahasa Melayu with deep learning models. malaysian-manglish-nlp is optimized for informal, code-switched Malaysian text with minimal overhead and advanced NLP features.

CLI Usage

# Full analysis
manglish analyze "Weh best gila makanan dia!"

# Sentiment
manglish sentiment "Teruk la service kat sini"

# Normalize shortforms
manglish normalize "xpe la bro aku otw"

# Translate
manglish translate "Aku nak pergi makan" --to en

# NER
manglish ner "Ahmad kerja kat Google Malaysia"

# Summarize file
manglish summarize --file article.txt

# Run benchmarks
manglish benchmark

# Profile performance
manglish profile "Sample text here"

REST API

# Start API server
uvicorn malaysian_manglish_nlp.rest_api:app --port 8000

# Or with Docker
docker-compose up -d

Endpoints:

  • POST /analyze — Full analysis
  • POST /sentiment — Sentiment only
  • POST /normalize — Normalize text
  • POST /translate — Translate text
  • POST /ner — Named entities
  • POST /pos — POS tags
  • POST /summarize — Summarize text
  • POST /batch — Batch process multiple texts
  • GET /health — Health check
  • GET /modules — List available modules

Testing

# Run all tests (900+ tests)
python -m pytest tests/ -q

# Run specific test file
python -m pytest tests/test_sentiment.py -v

# Run with coverage
python -m pytest tests/ --cov=malaysian_manglish_nlp

# Run heavy tests (requires gensim)
RUN_HEAVY_TESTS=1 python -m pytest tests/test_word_embeddings.py -v

Documentation

Full documentation available at manglish-nlp.readthedocs.io

Includes:

  • Module reference for all 51 modules
  • API documentation with examples
  • Performance benchmarks
  • Comparison with Malaya
  • Contributing guide
  • Changelog

Contributing

Contributions welcome! Areas where help is needed:

  1. More training data — Manglish text samples from social media
  2. Dialect support — More regional variants and normalization rules
  3. Benchmarks — Comparative benchmarks on Malaysian NLP datasets
  4. Documentation — More usage examples and tutorials
git clone https://github.com/ZafranYusof/malaysia-manglish-nlp.git
cd malaysian-manglish-nlp
pip install -e ".[all]"
python -m pytest tests/ -q

License

MIT — see LICENSE for details.

Citation

If you use malaysian-manglish-nlp in your research, please cite:

@software{malaysian_manglish_nlp,
  author = {Zafran},
  title = {malaysian-manglish-nlp: Full NLP toolkit for Malaysian Manglish},
  year = {2026},
  url = {https://github.com/ZafranYusof/malaysia-manglish-nlp},
  version = {3.2.0}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malaysian_manglish_nlp-3.3.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.14Windows x86-64

File details

Details for the file malaysian_manglish_nlp-3.3.0.tar.gz.

File metadata

  • Download URL: malaysian_manglish_nlp-3.3.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for malaysian_manglish_nlp-3.3.0.tar.gz
Algorithm Hash digest
SHA256 a34084e1fb5d2dec8dfd419daee1238e15780292593cbf3e262c80eb2ab12b79
MD5 dd4ea9d8db75de5e059432ee9510fda3
BLAKE2b-256 063ea553c58acb2538275893db61d77fa029be40402ba532fa9dc4ca49f7c288

See more details on using hashes here.

File details

Details for the file malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 fd37419f6f62c2184611440a968c52c08c933a7f1d850b97189f8e6d99290b4e
MD5 f65d68be989f471ef910ff74f4afc9bc
BLAKE2b-256 05b5e7259d81775d63b7ea67911714320989d4ae1ad489edce7a843bca10140a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page