CAO Official — fast, clean emotiCon Analysis and decOding of affective information (Japanese kaomoji affect analysis)
Project description
CAO Official 0.5.0
A fast, clean reimplementation of CAO — emotiCon Analysis and decOding of
affective information — the system that detects Japanese emoticons (kaomoji,
e.g. (^_^)) in text and classifies the emotion they express, grounded in
Birdwhistell's theory of kinesics (an emoticon is body language split into
semantic kinemes: eyes, mouth, decorations).
Based on: Ptaszynski et al., "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics", IEEE Transactions on Affective Computing, 2010. This is a from-scratch Python rewrite of the original C# system — faster, fixed, and runnable. See CHANGELOG.md for the lineage, docs/API.md for the API reference, and ANALYSIS.md for how it differs from the legacy code and the paper.
PyPI name: the bare
caois taken, so the distribution iscao-officialand the import package iscao_official(CLI:cao-official).
New in 0.5: partial faces (bracketless ^o^, one-bracket (^o^, mouthless
(^^) / (--)); a probabilistic Naive-Bayes scorer that is the new default
(fixes the old relief bias, gives calibrated confidence); a statistical
borderline detector; mmap model; batch/async API; and pip-installable packaging.
Citation
If you use CAO, please cite:
M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka, K. Araki. "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics." IEEE Transactions on Affective Computing, Vol. 1, No. 1, 2010.
@article{ptaszynski2010cao,
title={CAO: A fully automatic emoticon analysis system based on theory of kinesics},
author={Ptaszynski, Michal and Maciejewski, Jacek and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji},
journal={IEEE Transactions on Affective Computing},
volume={1},
number={1},
pages={46--59},
year={2010},
publisher={IEEE}
}
What it does
Given an emoticon or a sentence, CAO runs three procedures:
- Detection — find emoticon spans in free text (face-anchored: a candidate must contain a recognized face core; brackets optional). 0.5 also finds partial faces — bracketless, single-bracket, and mouthless (eye–eye) — via a gated fallback anchor, and rejects prose/number noise.
- Extraction — segment each emoticon into its seven structural areas
[additional][bracket][internal][ FACE ][internal][bracket][additional]and decompose the face into eye / mouth / eye (occurrence-weighted; empty mouth allowed). - Affect analysis — score the parts against ten per-emotion databases and decide a single emotion, with a calibrated confidence and a 2-D coordinate.
The ten emotions (Nakamura): anger, dislike, excitement, fear, fondness, joy, relief, shame, sorrow, surprise — also projected onto Russell's valence × activation plane.
Install
Requires Python 3.10+ and numpy. pyahocorasick is recommended
(C-accelerated matching; the code falls back to a pure-Python automaton — only
~15% slower — without it).
From PyPI:
pip install cao-official # core
pip install "cao-official[fast]" # + pyahocorasick (recommended)
The wheel ships a prebuilt model (cao.model + cao.model.npy), so the
first Cao() loads in ~5 ms with no database needed — the package is
self-contained.
From source (for development / rebuilding the model from the database):
git clone https://github.com/ptaszynski/cao-official
cd cao-official
python3 -m venv .venv && ./.venv/bin/pip install -e ".[fast,dev]"
./.venv/bin/python -m cao_official --build # rebuild model from ../cao_0.2/data
Quickstart
Python API
from cao_official import Cao
cao = Cao() # loads the model (builds it once if missing)
r = cao.analyze("(`Д´)")
print(r.label) # 'anger'
print(r.confidence) # 0.71 (calibrated posterior)
print(r.valence, r.activation) # -0.51 0.70 -> negative-activated
print(r.areas.as_row()) # ['N/A', '(', 'N/A', '`Д́', 'N/A', ')', 'N/A']
print(r.ranking()[:3]) # [('anger', 0.71), ('excitement', 0.10), ...]
print(r.attribution[0]) # ('face', '`Д́', 'anger', 88.0) -- why
# detect + analyze every emoticon in a sentence (partial faces included)
for r in cao.analyze_text("今日は嬉しい^o^けど(--)気分"):
print(r.span, r.emoticon, "->", r.label)
# stream over a big corpus (optionally across processes)
for i, results in cao.analyze_batch(open("corpus.txt"), workers=4):
...
# pick a different scoring method (the five paper methods remain available)
cao.analyze("(^o^)", method="frequency")
Web app
An interactive Streamlit demo (app.py), bilingual (English / 日本語), with
single-emoticon, free-text, and whole-document (file upload) modes; it shows
detection, the emotion ranking, the Russell valence×activation plot, the 7-area
breakdown, and the kineme attribution. The header logo recolours to the light/dark
theme automatically.
pip install "cao-official[app]" # streamlit + altair + pandas + pillow
streamlit run app.py
Deploy on Streamlit Community Cloud: point it at this repo and app.py; the
included requirements.txt installs the dependencies, the package and its
prebuilt model are in the repo, so it runs as-is.
Command line
cao-official "(^_^)" "(`Д´)" # after `pip install`
python -m cao_official "(^_^)" "(`Д´)" # or via the module
echo "嬉しい(^o^)です" | python -m cao_official --text # detect in free text
python -m cao_official --json "orz" # JSON output
python -m cao_official --explain "(`Д´)" # show kineme attribution
python -m cao_official --method frequency "(^o^)" # pick a scoring method
python -m cao_official --build # rebuild the model
Example:
emoticon: (`Д´)
split : N/A | ( | N/A | `Д́ | N/A | ) | N/A
label : anger (confidence 0.71, via bayes)
2-D : valence=-0.51 activation=+0.70 [negative-activated]
top-3 : anger=0.7058, excitement=0.1009, dislike=0.07281
why : face='`Д́'->anger(88), left_eye+right_eye='`\t́'->anger(65), mouth='Д'->anger(42)
How it works
text ──► normalize (NFKC) ──► detect (Aho-Corasick, face-anchored + mouthless fallback)
──► extract (position-aware 7 areas + eye/mouth) ──► score (5 methods + Naive-Bayes)
──► decide (label + calibrated confidence + Russell 2-D)
- One model, built once. The released per-emotion databases are parsed into
contiguous
float32stat tensors (one(N, 10, 5)buffer + adict[str → row]per table) and a face-core automaton, serialized tocao_official/cao.model(+ a sidecarcao.model.npytensor buffer, mmappable). Runtime never re-parses text files; loading is ~5 ms. - One O(text) pass, shared. A single Aho-Corasick automaton over the normalized line replaces the legacy giant regex alternation, the per-character detection loop, and every linear database scan — and its hits drive both detection and extraction (no second scan).
- Normalized matching with offset map. NFKC unifies full/half-width and
combining/precomposed forms (
(^_^)≡(^_^)); a canonical→original index map lets detection work in normalized space yet report exact original spans.
Scoring methods
Each part is matched exactly against its database; the face follows the cascade
raw whole-emoticon → triplet core → eyes + mouth, and the surrounding areas
are added at the paper's 0.25 weight (now separately configurable for internal
vs additional). Six methods are available:
| method | meaning | direction |
|---|---|---|
occurrence |
raw count in the emotion DB | higher = stronger |
frequency |
occurrence ÷ total occurrences in that DB | higher = stronger |
uniqueFrequency |
occurrence ÷ #unique elements in that DB | higher = stronger |
position |
rank by occurrence (ties share) | lower = stronger |
uniquePosition |
dense rank by occurrence | lower = stronger |
bayes (default) |
smoothed Naive-Bayes posterior over the parts | higher = probability |
bayes is the 0.5 default. It composes the kineme evidence as a log-product
log P(e) + Σ wᵢ·log P(partᵢ|e) with Lidstone smoothing — the canonical
generative model, of which the paper's frequency is a single-part special case.
It yields a real posterior probability (so confidence is calibrated by
temperature scaling) and removes the uniqueFrequency small-DB bias toward
relief. It won the cross-validation bake-off (eval bakeoff: best macro-F1 0.322 and
best top-3 69.1%; in-context probe top-1 28.1% vs 22.1% for the old default,
top-3 57.8%). All methods are selectable via method= and exposed in
result.scores.
Performance
| operation | time |
|---|---|
| build model | ~0.2 s (once) |
| load model | ~6 ms (eager) |
| model artifact | ~5 MB (0.8 MB pickle + 4.2 MB mmappable .npy) |
| full 1000-sentence corpus | ~15 ms (~65k sentences/s end-to-end) |
| pure-Python Aho-Corasick | ~15% slower than native pyahocorasick |
Run the harness: python -m cao_official.eval (add bakeoff, calibrate, or
bench). Throughput is slightly lower than 0.4 because the default bayes
scorer computes a posterior per emoticon — still well over 60k sentences/s.
Scales with the database, not against it. Detection, matching and scoring
are all O(text), independent of how many emoticons are in the model: a single
Aho-Corasick pass anchors detection and feeds extraction, lookups are
dict → row into one contiguous float32 tensor, and every statistic is
precomputed at build time. Growing the emoticon database 10× or 100× leaves
per-analysis cost unchanged — only the one-off build and the (compact, linear)
artifact grow.
Project layout
cao-official_0.5/
pyproject.toml pip-installable (dist `cao-official`, CLI `cao-official`)
cao_official/ the package
normalize.py NFKC canonicalization
model.py Emotion set, Russell 2-D, EmoResult, methods
automaton.py Aho-Corasick (pyahocorasick + pure-python fallback)
langmodel.py one-class char-LM detector for borderline spans
database.py build/serialize the model (cao.model + cao.model.npy)
detect.py face-anchored detection + mouthless fallback
extract.py position-aware 7-area segmentation + eye/mouth split
score.py 5 paper methods + Naive-Bayes, cascade, decision, 2-D
cao.py the Cao facade (analyze / analyze_text / analyze_batch)
cli.py / __main__ command line
eval.py throughput + bake-off + calibrate + bench
reference_port/ faithful 1:1 port of the C# (regression oracle)
tests/ behaviour tests (python tests/test_cao.py)
docs/API.md API reference + theory-of-kinesics primer
README.md CHANGELOG.md ROADMAP.md ANALYSIS.md PLAN.md
Limitations & honest caveats
- Scope: kaomoji only. Unicode emoji and Western emoticons are out of scope for now (the design extends to them — see ROADMAP; emoji was deferred this pass by decision).
- Accuracy is data-bound. Labels reflect the released database, which is
more deduplicated than the paper's full DB; e.g.
^o^resolves where it is densest in this DB. Thejoyclass barely surfaces because it overlaps almost entirely withfondnessin this release. - Evaluation is in-database. The bake-off uses the labeled DB itself as gold (resubstitution-style, raw stage disabled); the in-context probe only labels 5 corpus blocks. A truly held-out, hand-labeled gold set is the top ROADMAP item.
- The
reliefsmall-DB bias is fixed by the defaultbayesscorer (relief predictions 1018 → 249 on the bake-off; gold 91).uniqueFrequencyand the other paper methods remain available but carry the original bias. - Bare 2-char bracketless faces in prose (e.g.
^^inわー^^ね) are below the detection length floor and skipped to avoid prose false-positives; bracketed ((^^)) and 3-char+ forms are found.
For the faithful, bug-for-bug behaviour of the original system, use
reference_port/cao_reference.py (validated against cao_0.2/eval/1000eval4.txt).
License
CAO is released under The BSD 3-Clause License — the whole system, including
the cao-official Python package and the bundled emoticon databases.
Copyright (c) 2007-2026, Michal Ptaszynski, Pawel Maciejewski, Pawel Dybala, Rafal Rzepka, Kenji Araki.
See LICENSE for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cao_official-0.5.0.tar.gz.
File metadata
- Download URL: cao_official-0.5.0.tar.gz
- Upload date:
- Size: 529.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a48bce1695614ecdf9f12c8d0f6f21187af796c2a44591af223b913fd5b2a8bd
|
|
| MD5 |
c2cb9bfede7a4c916858d81d8f728330
|
|
| BLAKE2b-256 |
96617616841b96bf81601edb6a9f01ae6c42280f46333401be0b7961b43131b0
|
File details
Details for the file cao_official-0.5.0-py3-none-any.whl.
File metadata
- Download URL: cao_official-0.5.0-py3-none-any.whl
- Upload date:
- Size: 522.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9dc2c20fd27fca504a7787cbde63450b8fff79a64cdc4cc040a08de95bcfa83
|
|
| MD5 |
7bd5a2819f2549daec6bd31dfc8ce3f3
|
|
| BLAKE2b-256 |
153dcd559f676265851b60c91251de2658f148e7379cdb44dc88870c0add5205
|