Full NLP toolkit for Malaysian Manglish - 51 modules, zero dependencies for core

These details have not been verified by PyPI

Project links

Project description

malaysian-manglish-nlp

Full NLP toolkit for Malaysian Manglish — 51 modules, zero dependencies for core.

Built for real-world Malaysian text: social media, news, chat messages, code-switched Malay-English content.

Installation

pip install malaysian-manglish-nlp

Extras

pip install malaysian-manglish-nlp[transformers]   # HuggingFace transformer models
pip install malaysian-manglish-nlp[embeddings]     # Word2Vec/FastText embeddings
pip install malaysian-manglish-nlp[spacy]          # spaCy integration
pip install malaysian-manglish-nlp[api]            # FastAPI REST API
pip install malaysian-manglish-nlp[langchain]      # LangChain tools
pip install malaysian-manglish-nlp[all]            # Everything

Quick Start

from malaysian_manglish_nlp import sentiment, normalize, ner, detect_language

# Sentiment analysis
result = sentiment("Weh best gila makanan dia!")
print(result)
# {'sentiment': 'positive', 'score': 0.94, 'raw_score': 2.5}

# Text normalization
clean = normalize("xpe la bro, aku ok je")
print(clean)  # "tidak apa la bro, aku okay sahaja"

# Named Entity Recognition
entities = ner("Ali pergi Pavilion KL semalam")
print(entities)
# [{'text': 'Ali', 'type': 'PERSON', 'start': 0, 'end': 3},
#  {'text': 'Pavilion KL', 'type': 'LOCATION', 'start': 10, 'end': 21}]

# Language detection
lang = detect_language("Eh jom makan, I'm hungry gila")
print(lang)
# {'language': 'manglish', 'confidence': 0.87}

HuggingFace

	Link
🤗 Model	vexccz/manglish-nlp-sentiment — XLM-Roberta multi-task (sentiment 95.0%, emotion 90.3%, intent 97.5%)
🤗 Dataset	vexccz/manglish-nlp-dataset — 14,384 labeled Manglish examples
🤗 Demo	vexccz/manglish-nlp-demo — Gradio interactive demo (7 tabs)

Features

Text Processing

normalize — Expand shortforms (638+ mappings: nk→nak, mcm→macam, sbb→sebab)
clean — Remove URLs, mentions, repeated chars, HTML
formalize — Convert informal to formal Malay (aku→saya, ko→anda)
tokenize — Malaysian-aware tokenizer (handles URLs, hashtags, emoticons)
stemmer — Malay stemmer with nasal assimilation (250+ roots)
segment — Sentence segmentation for code-switched text
spelling — Spell checking with Malaysian dictionary

Analysis

sentiment — Sentiment analysis with aspect-based (food, service, price, etc.)
emotion — 8 emotion categories (happy, sad, angry, fear, surprise, disgust, love, neutral)
sarcasm — Sarcasm detection for Malaysian text
hate_speech — Hate speech detection (6 categories, severity levels)
intent — 8 intent types (question, request, complaint, greeting, opinion, statement, command, offer)
topic — 12 topic classification (food, politics, sports, tech, education, etc.)
stance — Stance detection (support/oppose/neutral)
profanity — Profanity detection with leetspeak evasion handling

Entity & Structure

ner — Named Entity Recognition (11 types: PERSON, ORG, LOC, PRODUCT, EVENT, MONEY, PHONE, EMAIL, DATE, TIME, PERCENT)
pos_tag — Part-of-speech tagging (15 tags)
dependency — Dependency parsing (SVO extraction)
coreference — Pronoun resolution with Malaysian gender heuristics
keywords — Keyword extraction (frequency, RAKE, TF-IDF, TextRank)

Language Detection & Code-Switching

language — Language identification (Malay/English/Manglish/Mixed)
code_switching — Code-switching point detection, switch ratio, segmentation by language
dialect — 6 Malay dialects (Standard, Kelantan, Terengganu, N9, Kedah, Sarawak, Sabah) with normalization

Semantic & Similarity

similarity — Text similarity (Jaccard, cosine, overlap, semantic)
embeddings — Word2Vec/FastText trained on Malaysian social media (518 vocab, 100d)
augmentation — Text augmentation (synonym replacement, shortform variation)

Generation & Understanding

translation — Rule-based BM↔EN translation (1000+ word pairs, phrase translation)
summarization — Extractive summarization using TextRank algorithm
text_generation — N-gram based text generation and autocomplete
qa — Extractive question answering with TF-IDF retrieval
discourse — Argument mining and fallacy detection
ocr_normalize — OCR text correction for Malaysian documents

Preprocessing & Utilities

normalizer — Advanced normalization (money, dates, times, elongated text)
dictionary — Malay-English dictionary lookup
similarity — Multiple similarity metrics
pipeline — Chain multiple modules together
calibration — Confidence scoring for predictions
hybrid_ml — Feature extraction + logistic classifier
evaluate — Model evaluation and regression tracking
cache — LRU caching for performance
profiler — Performance benchmarking tools
tuning — Hyperparameter tuning and threshold optimization

Integration

spacy_integration — Custom spaCy Language class and pipeline components
rest_api — FastAPI REST API with rate limiting and CORS
langchain_tool — LangChain tool wrappers
CLI — Command-line interface with subcommands

Fine-Tuned Model

An XLM-Roberta multi-task model fine-tuned on 14,384 labeled Manglish examples for sentiment, emotion, and intent classification.

from malaysian_manglish_nlp.transformers.manglish_model import load_model, predict

model = load_model()  # Auto-downloads from HuggingFace on first use
result = predict("gila best servis ni")
# {'sentiment': {'label': 'positive', 'confidence': 0.96},
#  'emotion':    {'label': 'happy',    'confidence': 0.85},
#  'intent':     {'label': 'opinion',  'confidence': 1.00}}

Task	Accuracy	Classes
Sentiment	95.0%	positive, negative, neutral
Emotion	90.3%	happy, sad, angry, fear, surprise, disgust, love, neutral
Intent	97.5%	question, statement, request, complaint, greeting, opinion, command, offer
Average	94.3%	—

Model: vexccz/manglish-nlp-sentiment on HuggingFace. Requires pip install malaysian-manglish-nlp[transformers].

Performance

23,000+ texts/sec sentiment analysis throughput (rule-based)
<0.5s import time for core modules
Zero dependencies for core text processing
LRU caching on heavy operations (stemmer, normalize, sentiment, language detection)
Lazy loading — only imports what you use
Pre-compiled regex patterns across 6 modules

Comparison with Malaya

Feature	malaysian-manglish-nlp	Malaya
Core dependencies	None	TensorFlow/PyTorch required
Import time	<0.5s	10-30s
Manglish-first	Built for informal MY text	Formal BM focus
Modules	51	~40
Throughput	23k+ texts/sec	Varies (GPU recommended)
Python support	3.8-3.12	3.8+
Aspect sentiment	✅	❌
Code-switching detection	✅	❌
Hate speech detection	✅	Limited
Discourse analysis	✅	❌
OCR normalization	✅	❌
Translation (rule-based)	✅	❌
Text generation	✅	❌

Both are solid choices. Malaya excels at formal Bahasa Melayu with deep learning models. malaysian-manglish-nlp is optimized for informal, code-switched Malaysian text with minimal overhead and advanced NLP features.

CLI Usage

# Full analysis
manglish analyze "Weh best gila makanan dia!"

# Sentiment
manglish sentiment "Teruk la service kat sini"

# Normalize shortforms
manglish normalize "xpe la bro aku otw"

# Translate
manglish translate "Aku nak pergi makan" --to en

# NER
manglish ner "Ahmad kerja kat Google Malaysia"

# Summarize file
manglish summarize --file article.txt

# Run benchmarks
manglish benchmark

# Profile performance
manglish profile "Sample text here"

REST API

# Start API server
uvicorn malaysian_manglish_nlp.rest_api:app --port 8000

# Or with Docker
docker-compose up -d

Endpoints:

POST /analyze — Full analysis
POST /sentiment — Sentiment only
POST /normalize — Normalize text
POST /translate — Translate text
POST /ner — Named entities
POST /pos — POS tags
POST /summarize — Summarize text
POST /batch — Batch process multiple texts
GET /health — Health check
GET /modules — List available modules

Testing

# Run all tests (900+ tests)
python -m pytest tests/ -q

# Run specific test file
python -m pytest tests/test_sentiment.py -v

# Run with coverage
python -m pytest tests/ --cov=malaysian_manglish_nlp

# Run heavy tests (requires gensim)
RUN_HEAVY_TESTS=1 python -m pytest tests/test_word_embeddings.py -v

Documentation

Full documentation available at manglish-nlp.readthedocs.io

Includes:

Module reference for all 51 modules
API documentation with examples
Performance benchmarks
Comparison with Malaya
Contributing guide
Changelog

Contributing

Contributions welcome! Areas where help is needed:

More training data — Manglish text samples from social media
Dialect support — More regional variants and normalization rules
Benchmarks — Comparative benchmarks on Malaysian NLP datasets
Documentation — More usage examples and tutorials

git clone https://github.com/ZafranYusof/malaysia-manglish-nlp.git
cd malaysian-manglish-nlp
pip install -e ".[all]"
python -m pytest tests/ -q

License

MIT — see LICENSE for details.

Citation

If you use malaysian-manglish-nlp in your research, please cite:

@software{malaysian_manglish_nlp,
  author = {Zafran},
  title = {malaysian-manglish-nlp: Full NLP toolkit for Malaysian Manglish},
  year = {2026},
  url = {https://github.com/ZafranYusof/malaysia-manglish-nlp},
  version = {3.2.0}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.3.0

Jun 1, 2026

3.2.0

May 31, 2026

3.1.0

May 30, 2026

3.0.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malaysian_manglish_nlp-3.3.0.tar.gz (1.3 MB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl (1.3 MB view details)

Uploaded Jun 1, 2026 CPython 3.14Windows x86-64

File details

Details for the file malaysian_manglish_nlp-3.3.0.tar.gz.

File metadata

Download URL: malaysian_manglish_nlp-3.3.0.tar.gz
Upload date: Jun 1, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for malaysian_manglish_nlp-3.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a34084e1fb5d2dec8dfd419daee1238e15780292593cbf3e262c80eb2ab12b79`
MD5	`dd4ea9d8db75de5e059432ee9510fda3`
BLAKE2b-256	`063ea553c58acb2538275893db61d77fa029be40402ba532fa9dc4ca49f7c288`

See more details on using hashes here.

File details

Details for the file malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl.

File metadata

Download URL: malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl
Upload date: Jun 1, 2026
Size: 1.3 MB
Tags: CPython 3.14, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for malaysian_manglish_nlp-3.3.0-cp314-cp314-win_amd64.whl
Algorithm	Hash digest
SHA256	`fd37419f6f62c2184611440a968c52c08c933a7f1d850b97189f8e6d99290b4e`
MD5	`f65d68be989f471ef910ff74f4afc9bc`
BLAKE2b-256	`05b5e7259d81775d63b7ea67911714320989d4ae1ad489edce7a843bca10140a`

See more details on using hashes here.

malaysian-manglish-nlp 3.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

malaysian-manglish-nlp

Installation

Extras

Quick Start

HuggingFace

Features

Text Processing

Analysis

Entity & Structure

Language Detection & Code-Switching

Semantic & Similarity

Generation & Understanding

Preprocessing & Utilities

Integration

Fine-Tuned Model

Performance

Comparison with Malaya

CLI Usage

REST API

Testing

Documentation

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes