Skip to main content

Full NLP toolkit for Malaysian Manglish — 51 modules, zero dependencies for core

Project description

manglish-nlp

PyPI version Python versions Docs License: MIT

Full NLP toolkit for Malaysian Manglish — 51 modules, zero dependencies for core.

Built for real-world Malaysian text: social media, news, chat messages, code-switched Malay-English content.

Installation

pip install manglish-nlp

Extras

pip install manglish-nlp[transformers]   # HuggingFace transformer models
pip install manglish-nlp[embeddings]     # Word2Vec/FastText embeddings
pip install manglish-nlp[spacy]          # spaCy integration
pip install manglish-nlp[api]            # FastAPI REST API
pip install manglish-nlp[langchain]      # LangChain tools
pip install manglish-nlp[all]            # Everything

Quick Start

from manglish_nlp import sentiment, normalize, ner, detect_language

# Sentiment analysis
result = sentiment("Weh best gila makanan dia!")
print(result)
# {'sentiment': 'positive', 'score': 0.94, 'raw_score': 2.5}

# Text normalization
clean = normalize("xpe la bro, aku ok je")
print(clean)  # "tidak apa la bro, aku okay sahaja"

# Named Entity Recognition
entities = ner("Ali pergi Pavilion KL semalam")
print(entities)
# [{'text': 'Ali', 'type': 'PERSON', 'start': 0, 'end': 3},
#  {'text': 'Pavilion KL', 'type': 'LOCATION', 'start': 10, 'end': 21}]

# Language detection
lang = detect_language("Eh jom makan, I'm hungry gila")
print(lang)
# {'language': 'manglish', 'confidence': 0.87}

Features (51 Modules)

Text Processing

  • normalize — Expand shortforms (638+ mappings: nk→nak, mcm→macam, sbb→sebab)
  • clean — Remove URLs, mentions, repeated chars, HTML
  • formalize — Convert informal to formal Malay (aku→saya, ko→anda)
  • tokenize — Malaysian-aware tokenizer (handles URLs, hashtags, emoticons)
  • stemmer — Malay stemmer with nasal assimilation (250+ roots)
  • segment — Sentence segmentation for code-switched text
  • spelling — Spell checking with Malaysian dictionary

Analysis

  • sentiment — Sentiment analysis with aspect-based (food, service, price, etc.)
  • emotion — 8 emotion categories (happy, sad, angry, fear, surprise, disgust, love, neutral)
  • sarcasm — Sarcasm detection for Malaysian text
  • hate_speech — Hate speech detection (6 categories, severity levels)
  • intent — 8 intent types (question, request, complaint, greeting, opinion, statement, command, offer)
  • topic — 12 topic classification (food, politics, sports, tech, education, etc.)
  • stance — Stance detection (support/oppose/neutral)
  • profanity — Profanity detection with leetspeak evasion handling

Entity & Structure

  • ner — Named Entity Recognition (11 types: PERSON, ORG, LOC, PRODUCT, EVENT, MONEY, PHONE, EMAIL, DATE, TIME, PERCENT)
  • pos_tag — Part-of-speech tagging (15 tags)
  • dependency — Dependency parsing (SVO extraction)
  • coreference — Pronoun resolution with Malaysian gender heuristics
  • keywords — Keyword extraction (frequency, RAKE, TF-IDF, TextRank)

Language Detection & Code-Switching

  • language — Language identification (Malay/English/Manglish/Mixed)
  • code_switching — Code-switching point detection, switch ratio, segmentation by language
  • dialect — 6 Malay dialects (Standard, Kelantan, Terengganu, N9, Kedah, Sarawak, Sabah) with normalization

Semantic & Similarity

  • similarity — Text similarity (Jaccard, cosine, overlap, semantic)
  • embeddings — Word2Vec/FastText trained on Malaysian social media (518 vocab, 100d)
  • augmentation — Text augmentation (synonym replacement, shortform variation)

Generation & Understanding

  • translation — Rule-based BM↔EN translation (1000+ word pairs, phrase translation)
  • summarization — Extractive summarization using TextRank algorithm
  • text_generation — N-gram based text generation and autocomplete
  • qa — Extractive question answering with TF-IDF retrieval
  • discourse — Argument mining and fallacy detection
  • ocr_normalize — OCR text correction for Malaysian documents

Preprocessing & Utilities

  • normalizer — Advanced normalization (money, dates, times, elongated text)
  • dictionary — Malay-English dictionary lookup
  • similarity — Multiple similarity metrics
  • pipeline — Chain multiple modules together
  • calibration — Confidence scoring for predictions
  • hybrid_ml — Feature extraction + logistic classifier
  • evaluate — Model evaluation and regression tracking
  • cache — LRU caching for performance
  • profiler — Performance benchmarking tools
  • tuning — Hyperparameter tuning and threshold optimization

Integration

  • spacy_integration — Custom spaCy Language class and pipeline components
  • rest_api — FastAPI REST API with rate limiting and CORS
  • langchain_tool — LangChain tool wrappers
  • CLI — Command-line interface with subcommands

Performance

  • 23,000+ texts/sec sentiment analysis throughput
  • <0.5s import time for core modules
  • Zero dependencies for core text processing
  • LRU caching on heavy operations (stemmer, normalize, sentiment, language detection)
  • Lazy loading — only imports what you use
  • Pre-compiled regex patterns across 6 modules

Comparison with Malaya

Feature manglish-nlp Malaya
Core dependencies None TensorFlow/PyTorch required
Import time <0.5s 10-30s
Manglish-first Built for informal MY text Formal BM focus
Modules 51 ~40
Throughput 23k+ texts/sec Varies (GPU recommended)
Python support 3.8-3.12 3.8+
Aspect sentiment
Code-switching detection
Hate speech detection Limited
Discourse analysis
OCR normalization
Translation (rule-based)
Text generation

Both are solid choices. Malaya excels at formal Bahasa Melayu with deep learning models. manglish-nlp is optimized for informal, code-switched Malaysian text with minimal overhead and advanced NLP features.

CLI Usage

# Full analysis
manglish analyze "Weh best gila makanan dia!"

# Sentiment
manglish sentiment "Teruk la service kat sini"

# Normalize shortforms
manglish normalize "xpe la bro aku otw"

# Translate
manglish translate "Aku nak pergi makan" --to en

# NER
manglish ner "Ahmad kerja kat Google Malaysia"

# Summarize file
manglish summarize --file article.txt

# Run benchmarks
manglish benchmark

# Profile performance
manglish profile "Sample text here"

REST API

# Start API server
uvicorn manglish_nlp.rest_api:app --port 8000

# Or with Docker
docker-compose up -d

Endpoints:

  • POST /analyze — Full analysis
  • POST /sentiment — Sentiment only
  • POST /normalize — Normalize text
  • POST /translate — Translate text
  • POST /ner — Named entities
  • POST /pos — POS tags
  • POST /summarize — Summarize text
  • POST /batch — Batch process multiple texts
  • GET /health — Health check
  • GET /modules — List available modules

Testing

# Run all tests (900+ tests)
python -m pytest tests/ -q

# Run specific test file
python -m pytest tests/test_sentiment.py -v

# Run with coverage
python -m pytest tests/ --cov=manglish_nlp

# Run heavy tests (requires gensim)
RUN_HEAVY_TESTS=1 python -m pytest tests/test_word_embeddings.py -v

Documentation

Full documentation available at manglish-nlp.readthedocs.io

Includes:

  • Module reference for all 51 modules
  • API documentation with examples
  • Performance benchmarks
  • Comparison with Malaya
  • Contributing guide
  • Changelog

Contributing

Contributions welcome! Areas where help is needed:

  1. More training data — Manglish text samples from social media
  2. Dialect support — More regional variants and normalization rules
  3. Benchmarks — Comparative benchmarks on Malaysian NLP datasets
  4. Documentation — More usage examples and tutorials
git clone https://github.com/ZafranYusof/manglish-nlp.git
cd manglish-nlp
pip install -e ".[all]"
python -m pytest tests/ -q

License

MIT — see LICENSE for details.

Citation

If you use manglish-nlp in your research, please cite:

@software{manglish_nlp,
  author = {Zafran},
  title = {manglish-nlp: Full NLP toolkit for Malaysian Manglish},
  year = {2026},
  url = {https://github.com/ZafranYusof/manglish-nlp},
  version = {3.0.0}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manglish_nlp-3.0.0.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

manglish_nlp-3.0.0-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file manglish_nlp-3.0.0.tar.gz.

File metadata

  • Download URL: manglish_nlp-3.0.0.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for manglish_nlp-3.0.0.tar.gz
Algorithm Hash digest
SHA256 b9c0789c23265ddbc724738ee8efd8e127609bebe7dfc170443047e755df26c7
MD5 4c9c53e3afd3189c57c8a2883e5e3858
BLAKE2b-256 b2c6377d8de650e7ad4788941851955283315a7725915373fae5719925a00c8c

See more details on using hashes here.

File details

Details for the file manglish_nlp-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: manglish_nlp-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for manglish_nlp-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fa327147b5d3f81b12d5e93f6b9fb1a91e0c282e673eaa9a51dbad3a4ea105ad
MD5 bd0626ef9a6408109c988af4b30d8c85
BLAKE2b-256 66c76ff9c1cb9b7275ed8b1a6ae4cd1132126fcf305a3935c935c33abef42e3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page