ML-Ask Official — high-performance Japanese emotion analysis (original system, Python rewrite)
Project description
ML-Ask Official
eMotive eLement and Expression Analysis system — official Python rewrite by the original author.
High-performance Japanese emotion analysis. Originally developed by Michal Ptaszynski, Pawel Dybala, Rafal Rzepka and Kenji Araki at Hokkaido University, the system was first described in Ptaszynski et al. (2017, Journal of Open Research Software) and extended with new dictionaries in Wang et al. (2024, Applied Sciences). This package is the official Python implementation maintained by the original author.
- 📦 PyPI: https://pypi.org/project/mlask-official/
- 🧪 Hosted demo: https://mlask-official.streamlit.app/
- 🐙 Source + issues: https://github.com/ptaszynski/mlask-official
- 📝 Changelog: CHANGELOG.md
- 🗺️ Roadmap: IMPROVEMENTS.md
- 📚 Citation file: CITATION.cff
Quick start
# 1. system MeCab + IPADIC (macOS shown; see Installation for other OSes)
brew install mecab mecab-ipadic
# 2. the package
pip install 'mlask-official[app]'
# 3. analyze a sentence
python -c "from mlask_official import MLAskOfficial; \
print(MLAskOfficial().analyze('彼のことは嫌いではない!')['valence'])"
# → POSITIVE
CLI and web app both ship in the same install:
echo "腹が立つ" | mlask analyze --format pipe
streamlit run streamlit_app.py # from a source checkout
About ML-Ask
ML-Ask (eMotive eLement and Expression Analysis system) is a keyword-based rule system for automatic affect annotation of Japanese utterances. It combines a curated lexicon of ~4,700 emotive expressions across 10 categories with a particle-stripped content-form pass and a Contextual Valence Shifter (CVS) layer for negation.
Features
- Combined and expanded dictionaries — Nakamura's original Dictionary of Emotive Expressions merged with the Wang & Isomura (2024) two-dictionary expansion (Hiejima's and Murakami's emotion dictionaries, plus automatically extracted expressions). Total: ~4,700 entries across 10 emotion classes, augmented with modern internet language (emoji, kaomoji, gyaru-go, katakana borrowings) per class.
- Russell's 2D circumplex model of affect — every emotion class is placed
on a (valence, arousal) plane; aggregate sentence orientation is reported
as
valence(POSITIVE / NEGATIVE / NEUTRAL, optionallymostly_*) andactivation(ACTIVE / PASSIVE / NEUTRAL). - Plutchik-wheel colour palette — all 10 emotion classes are colour-coded by the hue angles of Plutchik's published wheel for a familiar, paper-ready palette.
- Dual Aho-Corasick matching — two automata are built at startup, one over fully lemmatised dictionary entries (covers verb inflections) and one over particle-stripped content forms (covers particle-omission variants such as 腹がたつ ↔ 腹たつ). Both automata scan in a single O(n + k) pass per sentence; sustained throughput is ~50,000 sentences/sec on a single core, ~100,000 sentences/sec on four.
- CVS (Contextual Valence Shifters) — 108 Japanese negation patterns
reverse emotion polarity when applied (
嫌いではない→ positive, not negative). An optional GiNZA dependency-tree pass catches long-distance negation that the local regex misses. - Three-state emotive distinction —
analyze()returnsemotive: booleven when no specific emotion word is detected, so callers can distinguish emotive-but-unclassifiable sentences (interjections / kaomoji only) from fully non-emotive ones. - Streaming + multiprocessing APIs —
analyze_stream()for memory-light corpus processing; auto-parallel batches at ≥ 50,000 sentences. - JA/EN Streamlit web app with publication-quality charts (radar + Russell 2D + time-series + heatmap), PNG export at 2× scale, and a language toggle that flips all UI strings + emotion labels.
- On-disk lemma cache — sub-millisecond warm-start once the cache is primed; MD5-invalidated.
- Optional UniDic backend via fugashi for users who prefer modern morphological analysis.
Emotion classes
| Name | English gloss | Japanese | Valence | Arousal | Plutchik hue |
|---|---|---|---|---|---|
| yorokobi | joy | 喜び | POS | ACT | yellow |
| suki | affection | 好き | POS | — | yellow-green |
| yasu | relief | 安らぎ | POS | PAS | green-yellow |
| takaburi | excitement | 昂り | NorP | ACT | orange |
| odoroki | surprise | 驚き | NorP | ACT | teal-cyan |
| haji | shame | 恥 | NorP | — | rose-purple |
| aware | sadness | 哀しみ | NEG | PAS | royal blue |
| iya | disgust | 嫌悪 | NEG | — | dark orchid |
| kowa | fear | 恐れ | NEG | ACT | green |
| ikari | anger | 怒り | NEG | ACT | crimson |
Representative emotion
For sentences where ML-Ask detects multiple emotion classes, the representative emotion is the single class chosen as the dominant one for the sentence. The heuristic — inherited from the original Perl ML-Ask — is:
The class whose longest matched expression has the most characters.
For example, in 「腹がたって仕方ない、もう嫌だ」 both ikari (腹が立つ) and
iya (嫌だ) match; ikari wins because 腹が立つ is longer than 嫌だ.
The intuition is that longer dictionary entries are more specific — and
therefore more diagnostic of the speaker's emotion — than shorter, more
generic ones.
Returned as result["representative"] = (class_name, [matching_words]).
Installation
ML-Ask Official runs on Linux, macOS and Windows (WSL recommended) with Python 3.10 – 3.13. It depends on the MeCab morphological analyser, which is a system package (not Python), so the install is split into two parts:
- install MeCab + a Japanese dictionary at the OS level,
- install the
mlask-officialPython package inside a virtualenv.
Step 1 — Install MeCab + a Japanese dictionary
macOS (Homebrew)
brew install mecab mecab-ipadic
Verify:
echo "今日は嬉しい" | mecab
You should see one token per line and an EOS marker.
Ubuntu / Debian
sudo apt-get update
sudo apt-get install -y mecab libmecab-dev mecab-ipadic-utf8
Fedora / RHEL
sudo dnf install mecab mecab-devel mecab-ipadic
Arch Linux
sudo pacman -S mecab mecab-ipadic
Windows
Native Windows MeCab is fragile — the recommended path is Windows Subsystem for Linux (WSL2): install Ubuntu under WSL and follow the Ubuntu instructions above. If you must run on bare Windows, see the mecab-python3 README for the MSVC build steps.
Step 2 — Create a Python virtual environment
Strongly recommended (keeps the package's dependencies out of your system Python):
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate.bat # Windows cmd
# .venv\Scripts\Activate.ps1 # Windows PowerShell
Make sure python --version reports 3.10 or newer.
Step 3 — Install mlask-official
From PyPI
pip install mlask-official # core: analyzer + CLI
pip install 'mlask-official[app]' # + Streamlit web app
pip install 'mlask-official[fugashi]' # + UniDic backend via fugashi
pip install 'mlask-official[deps]' # + GiNZA dependency-tree CVS
pip install 'mlask-official[all]' # everything above
From a source checkout
git clone https://github.com/ptaszynski/mlask-official.git
cd mlask-official
pip install -e . # editable core install
pip install -e '.[all]' # editable + every extra
The base install pulls in mecab-python3, pyahocorasick, and typer
automatically.
Step 4 — Verify the installation
Python API
python -c "from mlask_official import MLAskOfficial; \
print(MLAskOfficial().analyze('今日は嬉しい!')['valence'])"
You should see:
POSITIVE
CLI
mlask --help
echo "彼のことは嫌いではない!" | mlask analyze --format pipe
The pipe-format output should look like:
彼のことは嫌いではない!|emotions:(2)|YOR:嫌い*CVS 嫌いな*CVS|SUK:嫌い*CVS 嫌いな*CVS||2D|POSITIVE|NEUTRAL
Streamlit app
The Streamlit application is part of the source repository. Easiest way to try it is the hosted demo:
To run it locally, clone the repo and launch from there:
git clone https://github.com/ptaszynski/mlask-official.git
cd mlask-official
pip install -e '.[app]'
bash run_app.sh # → http://localhost:8501
bash run_app.sh --server.port 8505 # custom port
Open the URL in your browser and try the Quick examples under the input box.
Step 5 — Troubleshooting
RuntimeError: Failed initializing MeCab (no such file: /usr/local/etc/mecabrc)
mecab-python3 can't find mecabrc. Find it and pass it explicitly:
mecab-config --sysconfdir # → e.g. /opt/homebrew/etc
python -c "from mlask_official import MLAskOfficial; \
print(MLAskOfficial(mecab_arg='-r /opt/homebrew/etc/mecabrc').analyze('嬉しい'))"
Or pass --mecab-arg "-r /opt/homebrew/etc/mecabrc" to the CLI. The
Streamlit app has a MeCab arguments field in the sidebar for the same
purpose.
No module named 'MeCab'
mecab-python3 failed to compile against your system MeCab. Re-install
with verbose output:
pip install --force-reinstall --verbose mecab-python3
On macOS the most common cause is missing Xcode command-line tools
(xcode-select --install).
No module named 'fugashi' / 'spacy'
Optional extras aren't installed. Either disable the feature
(MLAskOfficial(backend="mecab", use_dependency_cvs=False)) or install
the relevant extra group from step 3.
built an empty emotion index / Not an Aho-Corasick automaton yet
MeCab returned no tokens for the shipped dictionary entries — usually a bad or mismatched dictionary path. The error message lists the three most common causes and the fix for each. See also the Notes on tokenisation section below.
Stale lemma cache after a manual dictionary edit
The cache is invalidated by file content (MD5), so saving the file will already invalidate it. To force a rebuild explicitly:
rm -rf ~/.cache/mlask_official
# or per-call:
python -c "from mlask_official import MLAskOfficial; MLAskOfficial(use_cache=False)"
Usage
from mlask_official import MLAskOfficial
a = MLAskOfficial()
# Inflected form — handled by full-lemma automaton
r = a.analyze("身の毛がよだった!")
print(r["emotion"]) # {'kowa': ['身の毛がよだつ']}
print(r["valence"]) # 'NEGATIVE'
print(r["activation"]) # 'ACTIVE'
print(r["emotive"]) # True
# Particle dropped — handled by content-lemma automaton
# (use the kanji form when possible — IPADIC's lemma for the kana
# writing `たつ` is the unrelated verb `経つ` "to elapse",
# so kana variants of ambiguous verbs may miss; see §Notes.)
r = a.analyze("腹立つ!")
print(r["emotion"]) # {'ikari': ['腹立ち', '腹立つ', '腹が立つ≈']}
r = a.analyze("身の毛よだつ") # particle が dropped
print(r["emotion"]) # {'kowa': ['身の毛がよだつ≈']}
# Negation via CVS
r = a.analyze("彼のことは嫌いではない!")
print(r["valence"]) # 'POSITIVE' ← 嫌い → CVS flip → yorokobi/suki
# Emotive but no classifiable emotion
r = a.analyze("あーもう!!")
print(r["emotion"]) # None
print(r["emotive"]) # True ← emotemes detected
print(r["intensifier"]) # {'emotemes': ['!','!'], 'interjections': ['あー','もう']}
# Non-emotive
r = a.analyze("今日は晴れです。")
print(r["emotion"]) # None
print(r["emotive"]) # False
Streaming + parallel APIs
# Generator — constant-memory for large corpora
for result in a.analyze_stream(open("big_corpus.txt", encoding="utf-8")):
process(result)
# Multiprocessing — auto-on for batches ≥ 50,000 sentences
results = a.analyze_batch(texts) # auto: parallel iff len(texts) ≥ 50_000
results = a.analyze_batch(texts, parallel=True, # force on
workers=8)
Notes on tokenisation
ML-Ask delegates tokenisation and lemmatisation to MeCab. Two practical consequences worth knowing:
-
Use IPADIC, not UniDic. The shipped dictionaries (Nakamura + Wang & Isomura) were compiled against the IPADIC POS scheme. UniDic tokenises some compounds differently and won't kanji-normalise kana writings, which reduces match coverage. If you previously installed
unidic-liteas a side effect of another package, point MeCab back at IPADIC explicitly:brew install mecab-ipadic # then either edit /opt/homebrew/etc/mecabrc to set # dicdir = /opt/homebrew/lib/mecab/dic/ipadic # or pass -d per call: MLAskOfficial(mecab_arg="-d /opt/homebrew/lib/mecab/dic/ipadic")
-
Kana writings of ambiguous verbs may miss. IPADIC's lemma table picks the most frequent reading for a kana writing.
たつin isolation lemmatises to経つ("to elapse"), not立つ("to stand"), so a kana-only input like腹たつwon't reach the腹が立つdictionary entry even with particle omission. The same input written腹立つor腹が立つmatches cleanly. Robust yomi/N-best parsing for these cases is tracked as IMPROVEMENTS.md §1.1 + §1.2.
Command-line interface
# Single sentence (stdin or --text)
echo "腹が立つ" | mlask analyze --format pipe
echo "彼のことは嫌いではない!" | mlask analyze --format json
# Batch a file
mlask batch -i corpus.txt -o results.csv --format csv
mlask batch -i corpus.txt -o results.json --format json --parallel
mlask batch -i corpus.txt --format pipe > results.txt
# Throughput benchmark
mlask benchmark --sentences 10000
mlask benchmark --sentences 100000 --parallel -j 8
# Mine candidate emotive expressions from a corpus (manual-review TSV)
mlask extract corpus.txt --output candidates.tsv --min-freq 5
All commands accept --backend mecab|fugashi and
--mecab-arg "-r /path/to/mecabrc".
Performance
On Apple Silicon (Python 3.14, mecab-python3 + IPADIC):
| Workload | Throughput |
|---|---|
| Cold start (no cache) | ~37 ms |
| Warm start (cache hit) | ~17 ms |
| Single sentence (steady-state) | 20 µs median, 46 µs p99 |
| Sequential batch (10,000 sentences) | ~50,000 sentences/sec |
| Multiprocessing batch (10,000 × 4 workers) | ~50,000 sentences/sec |
Auto-parallel analyze_batch(50,000) |
~100,000 sentences/sec |
See CHANGELOG.md for full benchmark methodology.
Citation
When using ML-Ask in research, please cite both of the following:
Ptaszynski, M., Dybala, P., Rzepka, R., Araki, K., & Masui, F. (2017). ML-Ask: Open source affect analysis software for textual input in Japanese. Journal of Open Research Software, 5(1), 16-16.
@article{ptaszynski2017ml,
title={ML-Ask: Open source affect analysis software for textual input in Japanese},
author={Ptaszynski, Michal and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji and Masui, Fumito},
journal={Journal of Open Research Software},
volume={5},
number={1},
pages={16--16},
year={2017}
}
Wang, L., Isomura, S., Ptaszynski, M., Dybala, P., Urabe, Y., Rzepka, R., & Masui, F. (2024). The limits of words: expanding a word-based emotion analysis system with multiple emotion dictionaries and the automatic extraction of emotive expressions. Applied Sciences, 14(11), 4439.
@article{wang2024limits,
title={The limits of words: expanding a word-based emotion analysis system with multiple emotion dictionaries and the automatic extraction of emotive expressions},
author={Wang, Lu and Isomura, Sho and Ptaszynski, Michal and Dybala, Pawel and Urabe, Yuki and Rzepka, Rafal and Masui, Fumito},
journal={Applied Sciences},
volume={14},
number={11},
pages={4439},
year={2024},
publisher={MDPI}
}
A machine-readable Citation File Format manifest is at
CITATION.cff.
Contributing
Issues, pull requests, and dictionary submissions are welcome at https://github.com/ptaszynski/mlask-official. See IMPROVEMENTS.md for the active roadmap; ❮ HIGH PRIORITY ❯ items are the best first contributions.
When opening a PR that touches the emotion dictionaries (mlask_official/emotions/*.txt),
please include:
- The source / rationale for each entry (paper, corpus reference, or example sentence).
- Evidence that the entry doesn't collide with an existing class
(
mlask analyze --text "<entry>"before and after). - A note in CHANGELOG.md under an
[Unreleased]section.
License
BSD 3-Clause — the same licence as the original ML-Ask system.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlask_official-0.5.0.tar.gz.
File metadata
- Download URL: mlask_official-0.5.0.tar.gz
- Upload date:
- Size: 69.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdafd9a76de62d83e75a42c782e9b03f3b34cffe75d3d7fba803ab5c86f81d28
|
|
| MD5 |
cb0adfab545710e6d2437bc9d613e877
|
|
| BLAKE2b-256 |
81621edccdf31e3e7ecfa2b1e3b761bd59571e973c559d2a42fe9131312421cf
|
File details
Details for the file mlask_official-0.5.0-py3-none-any.whl.
File metadata
- Download URL: mlask_official-0.5.0-py3-none-any.whl
- Upload date:
- Size: 71.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73ed0ca6823f6b57ac3b2ea991de28ed4c138adf6150e8e8551d613e49d9cbf2
|
|
| MD5 |
9b28c8bba4fe0a711eaf90efc0f40d3e
|
|
| BLAKE2b-256 |
aa77477e7b0d53ccff5cc425854d5fcd45d1fbaba46936d5930ead83874f3aca
|