Skip to main content

Full NLP toolkit for Malaysian Manglish - 51 modules, zero dependencies for core

Project description

malaysian-manglish-nlp

PyPI version Python versions Docs HuggingFace Dataset Demo License: MIT

Full NLP toolkit for Malaysian Manglish — 51 modules, zero dependencies for core.

Built for real-world Malaysian text: social media, news, chat messages, code-switched Malay-English content.

Installation

pip install malaysian-manglish-nlp

Extras

pip install malaysian-manglish-nlp[transformers]   # HuggingFace transformer models
pip install malaysian-manglish-nlp[embeddings]     # Word2Vec/FastText embeddings
pip install malaysian-manglish-nlp[spacy]          # spaCy integration
pip install malaysian-manglish-nlp[api]            # FastAPI REST API
pip install malaysian-manglish-nlp[langchain]      # LangChain tools
pip install malaysian-manglish-nlp[all]            # Everything

Quick Start

from malaysian_manglish_nlp import sentiment, normalize, ner, detect_language

# Sentiment analysis
result = sentiment("Weh best gila makanan dia!")
print(result)
# {'sentiment': 'positive', 'score': 0.94, 'raw_score': 2.5}

# Text normalization
clean = normalize("xpe la bro, aku ok je")
print(clean)  # "tidak apa la bro, aku okay sahaja"

# Named Entity Recognition
entities = ner("Ali pergi Pavilion KL semalam")
print(entities)
# [{'text': 'Ali', 'type': 'PERSON', 'start': 0, 'end': 3},
#  {'text': 'Pavilion KL', 'type': 'LOCATION', 'start': 10, 'end': 21}]

# Language detection
lang = detect_language("Eh jom makan, I'm hungry gila")
print(lang)
# {'language': 'manglish', 'confidence': 0.87}

HuggingFace

Link
🤗 Model vexccz/manglish-nlp-sentiment — DistilBERT multi-task (sentiment 88.5%, emotion 83.6%, intent 94.5%)
🤗 Dataset vexccz/manglish-nlp-dataset — 7,884 labeled Manglish examples
🤗 Demo vexccz/manglish-nlp-demo — Gradio interactive demo (7 tabs)

Features

Text Processing

  • normalize — Expand shortforms (638+ mappings: nk→nak, mcm→macam, sbb→sebab)
  • clean — Remove URLs, mentions, repeated chars, HTML
  • formalize — Convert informal to formal Malay (aku→saya, ko→anda)
  • tokenize — Malaysian-aware tokenizer (handles URLs, hashtags, emoticons)
  • stemmer — Malay stemmer with nasal assimilation (250+ roots)
  • segment — Sentence segmentation for code-switched text
  • spelling — Spell checking with Malaysian dictionary

Analysis

  • sentiment — Sentiment analysis with aspect-based (food, service, price, etc.)
  • emotion — 8 emotion categories (happy, sad, angry, fear, surprise, disgust, love, neutral)
  • sarcasm — Sarcasm detection for Malaysian text
  • hate_speech — Hate speech detection (6 categories, severity levels)
  • intent — 8 intent types (question, request, complaint, greeting, opinion, statement, command, offer)
  • topic — 12 topic classification (food, politics, sports, tech, education, etc.)
  • stance — Stance detection (support/oppose/neutral)
  • profanity — Profanity detection with leetspeak evasion handling

Entity & Structure

  • ner — Named Entity Recognition (11 types: PERSON, ORG, LOC, PRODUCT, EVENT, MONEY, PHONE, EMAIL, DATE, TIME, PERCENT)
  • pos_tag — Part-of-speech tagging (15 tags)
  • dependency — Dependency parsing (SVO extraction)
  • coreference — Pronoun resolution with Malaysian gender heuristics
  • keywords — Keyword extraction (frequency, RAKE, TF-IDF, TextRank)

Language Detection & Code-Switching

  • language — Language identification (Malay/English/Manglish/Mixed)
  • code_switching — Code-switching point detection, switch ratio, segmentation by language
  • dialect — 6 Malay dialects (Standard, Kelantan, Terengganu, N9, Kedah, Sarawak, Sabah) with normalization

Semantic & Similarity

  • similarity — Text similarity (Jaccard, cosine, overlap, semantic)
  • embeddings — Word2Vec/FastText trained on Malaysian social media (518 vocab, 100d)
  • augmentation — Text augmentation (synonym replacement, shortform variation)

Generation & Understanding

  • translation — Rule-based BM↔EN translation (1000+ word pairs, phrase translation)
  • summarization — Extractive summarization using TextRank algorithm
  • text_generation — N-gram based text generation and autocomplete
  • qa — Extractive question answering with TF-IDF retrieval
  • discourse — Argument mining and fallacy detection
  • ocr_normalize — OCR text correction for Malaysian documents

Preprocessing & Utilities

  • normalizer — Advanced normalization (money, dates, times, elongated text)
  • dictionary — Malay-English dictionary lookup
  • similarity — Multiple similarity metrics
  • pipeline — Chain multiple modules together
  • calibration — Confidence scoring for predictions
  • hybrid_ml — Feature extraction + logistic classifier
  • evaluate — Model evaluation and regression tracking
  • cache — LRU caching for performance
  • profiler — Performance benchmarking tools
  • tuning — Hyperparameter tuning and threshold optimization

Integration

  • spacy_integration — Custom spaCy Language class and pipeline components
  • rest_api — FastAPI REST API with rate limiting and CORS
  • langchain_tool — LangChain tool wrappers
  • CLI — Command-line interface with subcommands

Fine-Tuned Model

A DistilBERT multi-task model fine-tuned on 7,884 labeled Manglish examples for sentiment, emotion, and intent classification.

from malaysian_manglish_nlp.transformers.manglish_model import load_model, predict

model = load_model()  # Auto-downloads from HuggingFace on first use
result = predict("gila best servis ni")
# {'sentiment': {'label': 'positive', 'confidence': 0.96},
#  'emotion':    {'label': 'happy',    'confidence': 0.85},
#  'intent':     {'label': 'opinion',  'confidence': 1.00}}
Task Accuracy Classes
Sentiment 88.5% positive, negative, neutral
Emotion 83.6% happy, sad, angry, fear, surprise, disgust, love, neutral
Intent 94.5% question, statement, request, complaint, greeting, opinion
Average 88.9%

Model: vexccz/manglish-nlp-sentiment on HuggingFace. Requires pip install malaysian-manglish-nlp[transformers].

Performance

  • 23,000+ texts/sec sentiment analysis throughput (rule-based)
  • <0.5s import time for core modules
  • Zero dependencies for core text processing
  • LRU caching on heavy operations (stemmer, normalize, sentiment, language detection)
  • Lazy loading — only imports what you use
  • Pre-compiled regex patterns across 6 modules

Comparison with Malaya

Feature malaysian-manglish-nlp Malaya
Core dependencies None TensorFlow/PyTorch required
Import time <0.5s 10-30s
Manglish-first Built for informal MY text Formal BM focus
Modules 51 ~40
Throughput 23k+ texts/sec Varies (GPU recommended)
Python support 3.8-3.12 3.8+
Aspect sentiment
Code-switching detection
Hate speech detection Limited
Discourse analysis
OCR normalization
Translation (rule-based)
Text generation

Both are solid choices. Malaya excels at formal Bahasa Melayu with deep learning models. malaysian-manglish-nlp is optimized for informal, code-switched Malaysian text with minimal overhead and advanced NLP features.

CLI Usage

# Full analysis
manglish analyze "Weh best gila makanan dia!"

# Sentiment
manglish sentiment "Teruk la service kat sini"

# Normalize shortforms
manglish normalize "xpe la bro aku otw"

# Translate
manglish translate "Aku nak pergi makan" --to en

# NER
manglish ner "Ahmad kerja kat Google Malaysia"

# Summarize file
manglish summarize --file article.txt

# Run benchmarks
manglish benchmark

# Profile performance
manglish profile "Sample text here"

REST API

# Start API server
uvicorn malaysian_manglish_nlp.rest_api:app --port 8000

# Or with Docker
docker-compose up -d

Endpoints:

  • POST /analyze — Full analysis
  • POST /sentiment — Sentiment only
  • POST /normalize — Normalize text
  • POST /translate — Translate text
  • POST /ner — Named entities
  • POST /pos — POS tags
  • POST /summarize — Summarize text
  • POST /batch — Batch process multiple texts
  • GET /health — Health check
  • GET /modules — List available modules

Testing

# Run all tests (900+ tests)
python -m pytest tests/ -q

# Run specific test file
python -m pytest tests/test_sentiment.py -v

# Run with coverage
python -m pytest tests/ --cov=malaysian_manglish_nlp

# Run heavy tests (requires gensim)
RUN_HEAVY_TESTS=1 python -m pytest tests/test_word_embeddings.py -v

Documentation

Full documentation available at manglish-nlp.readthedocs.io

Includes:

  • Module reference for all 51 modules
  • API documentation with examples
  • Performance benchmarks
  • Comparison with Malaya
  • Contributing guide
  • Changelog

Contributing

Contributions welcome! Areas where help is needed:

  1. More training data — Manglish text samples from social media
  2. Dialect support — More regional variants and normalization rules
  3. Benchmarks — Comparative benchmarks on Malaysian NLP datasets
  4. Documentation — More usage examples and tutorials
git clone https://github.com/ZafranYusof/malaysia-manglish-nlp.git
cd malaysian-manglish-nlp
pip install -e ".[all]"
python -m pytest tests/ -q

License

MIT — see LICENSE for details.

Citation

If you use malaysian-manglish-nlp in your research, please cite:

@software{malaysian_manglish_nlp,
  author = {Zafran},
  title = {malaysian-manglish-nlp: Full NLP toolkit for Malaysian Manglish},
  year = {2026},
  url = {https://github.com/ZafranYusof/malaysia-manglish-nlp},
  version = {3.1.0}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malaysian_manglish_nlp-3.1.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malaysian_manglish_nlp-3.1.0-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

File details

Details for the file malaysian_manglish_nlp-3.1.0.tar.gz.

File metadata

  • Download URL: malaysian_manglish_nlp-3.1.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for malaysian_manglish_nlp-3.1.0.tar.gz
Algorithm Hash digest
SHA256 4c84c73d63c6e007914a7be1f6fbe4197e88ec3c23a21e4bfa73f9033746add2
MD5 d65996754336a74bef7ccd75c723e8b4
BLAKE2b-256 67f3dfa1a86d619974c20d1c35e4cb08b6824c94226284358d1750021073a8a1

See more details on using hashes here.

File details

Details for the file malaysian_manglish_nlp-3.1.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for malaysian_manglish_nlp-3.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9955a41f4f2b1a8db079c1f3860869fb2c95154bc753e33180e7311b173794bd
MD5 6154622b9a57134d00a78bf751ba5a72
BLAKE2b-256 678e57ccc0ebfd10baf7b71eae003d3c03190ac4cb5a6774665c0292f5d47665

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page