Skip to main content

CAO Official — fast, clean emotiCon Analysis and decOding of affective information (Japanese kaomoji affect analysis)

Project description

CAO Official 0.5.0

A fast, clean reimplementation of CAOemotiCon Analysis and decOding of affective information — the system that detects Japanese emoticons (kaomoji, e.g. (^_^)) in text and classifies the emotion they express, grounded in Birdwhistell's theory of kinesics (an emoticon is body language split into semantic kinemes: eyes, mouth, decorations).

Based on: Ptaszynski et al., "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics", IEEE Transactions on Affective Computing, 2010. This is a from-scratch Python rewrite of the original C# system — faster, fixed, and runnable. See CHANGELOG.md for the lineage, docs/API.md for the API reference, and ANALYSIS.md for how it differs from the legacy code and the paper.

PyPI name: the bare cao is taken, so the distribution is cao-official and the import package is cao_official (CLI: cao-official).

New in 0.5: partial faces (bracketless ^o^, one-bracket (^o^, mouthless (^^) / (--)); a probabilistic Naive-Bayes scorer that is the new default (fixes the old relief bias, gives calibrated confidence); a statistical borderline detector; mmap model; batch/async API; and pip-installable packaging.


Citation

If you use CAO, please cite:

M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka, K. Araki. "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics." IEEE Transactions on Affective Computing, Vol. 1, No. 1, 2010.

@article{ptaszynski2010cao,
  title={CAO: A fully automatic emoticon analysis system based on theory of kinesics},
  author={Ptaszynski, Michal and Maciejewski, Jacek and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji},
  journal={IEEE Transactions on Affective Computing},
  volume={1},
  number={1},
  pages={46--59},
  year={2010},
  publisher={IEEE}
}

What it does

Given an emoticon or a sentence, CAO runs three procedures:

  1. Detection — find emoticon spans in free text (face-anchored: a candidate must contain a recognized face core; brackets optional). 0.5 also finds partial faces — bracketless, single-bracket, and mouthless (eye–eye) — via a gated fallback anchor, and rejects prose/number noise.
  2. Extraction — segment each emoticon into its seven structural areas [additional][bracket][internal][ FACE ][internal][bracket][additional] and decompose the face into eye / mouth / eye (occurrence-weighted; empty mouth allowed).
  3. Affect analysis — score the parts against ten per-emotion databases and decide a single emotion, with a calibrated confidence and a 2-D coordinate.

The ten emotions (Nakamura): anger, dislike, excitement, fear, fondness, joy, relief, shame, sorrow, surprise — also projected onto Russell's valence × activation plane.


Install

Requires Python 3.10+ and numpy. pyahocorasick is recommended (C-accelerated matching; the code falls back to a pure-Python automaton — only ~15% slower — without it).

From PyPI:

pip install cao-official          # core
pip install "cao-official[fast]"  # + pyahocorasick (recommended)

The wheel ships a prebuilt model (cao.model + cao.model.npy), so the first Cao() loads in ~5 ms with no database needed — the package is self-contained.

From source (for development / rebuilding the model from the database):

git clone https://github.com/ptaszynski/cao-official
cd cao-official
python3 -m venv .venv && ./.venv/bin/pip install -e ".[fast,dev]"
./.venv/bin/python -m cao_official --build   # rebuild model from ../cao_0.2/data

Quickstart

Python API

from cao_official import Cao

cao = Cao()                       # loads the model (builds it once if missing)

r = cao.analyze("(`Д´)")
print(r.label)                    # 'anger'
print(r.confidence)               # 0.71   (calibrated posterior)
print(r.valence, r.activation)    # -0.51 0.70   -> negative-activated
print(r.areas.as_row())           # ['N/A', '(', 'N/A', '`Д́', 'N/A', ')', 'N/A']
print(r.ranking()[:3])            # [('anger', 0.71), ('excitement', 0.10), ...]
print(r.attribution[0])           # ('face', '`Д́', 'anger', 88.0)  -- why

# detect + analyze every emoticon in a sentence (partial faces included)
for r in cao.analyze_text("今日は嬉しい^o^けど(--)気分"):
    print(r.span, r.emoticon, "->", r.label)

# stream over a big corpus (optionally across processes)
for i, results in cao.analyze_batch(open("corpus.txt"), workers=4):
    ...

# pick a different scoring method (the five paper methods remain available)
cao.analyze("(^o^)", method="frequency")

Web app

An interactive Streamlit demo (app.py), bilingual (English / 日本語), with single-emoticon, free-text, and whole-document (file upload) modes; it shows detection, the emotion ranking, the Russell valence×activation plot, the 7-area breakdown, and the kineme attribution. The header logo recolours to the light/dark theme automatically.

pip install "cao-official[app]"   # streamlit + altair + pandas + pillow
streamlit run app.py

Deploy on Streamlit Community Cloud: point it at this repo and app.py; the included requirements.txt installs the dependencies, the package and its prebuilt model are in the repo, so it runs as-is.

Command line

cao-official "(^_^)" "(`Д´)"                       # after `pip install`
python -m cao_official "(^_^)" "(`Д´)"             # or via the module
echo "嬉しい(^o^)です" | python -m cao_official --text   # detect in free text
python -m cao_official --json "orz"                  # JSON output
python -m cao_official --explain "(`Д´)"             # show kineme attribution
python -m cao_official --method frequency "(^o^)"    # pick a scoring method
python -m cao_official --build                       # rebuild the model

Example:

emoticon: (`Д´)
  split : N/A | ( | N/A | `Д́ | N/A | ) | N/A
  label : anger  (confidence 0.71, via bayes)
  2-D   : valence=-0.51 activation=+0.70 [negative-activated]
  top-3 : anger=0.7058, excitement=0.1009, dislike=0.07281
  why   : face='`Д́'->anger(88), left_eye+right_eye='`\t́'->anger(65), mouth='Д'->anger(42)

How it works

text ──► normalize (NFKC) ──► detect (Aho-Corasick, face-anchored + mouthless fallback)
     ──► extract (position-aware 7 areas + eye/mouth) ──► score (5 methods + Naive-Bayes)
     ──► decide (label + calibrated confidence + Russell 2-D)
  • One model, built once. The released per-emotion databases are parsed into contiguous float32 stat tensors (one (N, 10, 5) buffer + a dict[str → row] per table) and a face-core automaton, serialized to cao_official/cao.model (+ a sidecar cao.model.npy tensor buffer, mmappable). Runtime never re-parses text files; loading is ~5 ms.
  • One O(text) pass, shared. A single Aho-Corasick automaton over the normalized line replaces the legacy giant regex alternation, the per-character detection loop, and every linear database scan — and its hits drive both detection and extraction (no second scan).
  • Normalized matching with offset map. NFKC unifies full/half-width and combining/precomposed forms ((^_^)(^_^)); a canonical→original index map lets detection work in normalized space yet report exact original spans.

Scoring methods

Each part is matched exactly against its database; the face follows the cascade raw whole-emoticon → triplet core → eyes + mouth, and the surrounding areas are added at the paper's 0.25 weight (now separately configurable for internal vs additional). Six methods are available:

method meaning direction
occurrence raw count in the emotion DB higher = stronger
frequency occurrence ÷ total occurrences in that DB higher = stronger
uniqueFrequency occurrence ÷ #unique elements in that DB higher = stronger
position rank by occurrence (ties share) lower = stronger
uniquePosition dense rank by occurrence lower = stronger
bayes (default) smoothed Naive-Bayes posterior over the parts higher = probability

bayes is the 0.5 default. It composes the kineme evidence as a log-product log P(e) + Σ wᵢ·log P(partᵢ|e) with Lidstone smoothing — the canonical generative model, of which the paper's frequency is a single-part special case. It yields a real posterior probability (so confidence is calibrated by temperature scaling) and removes the uniqueFrequency small-DB bias toward relief. It won the cross-validation bake-off (eval bakeoff: best macro-F1 0.322 and best top-3 69.1%; in-context probe top-1 28.1% vs 22.1% for the old default, top-3 57.8%). All methods are selectable via method= and exposed in result.scores.


Performance

operation time
build model ~0.2 s (once)
load model ~6 ms (eager)
model artifact ~5 MB (0.8 MB pickle + 4.2 MB mmappable .npy)
full 1000-sentence corpus ~15 ms (~65k sentences/s end-to-end)
pure-Python Aho-Corasick ~15% slower than native pyahocorasick

Run the harness: python -m cao_official.eval (add bakeoff, calibrate, or bench). Throughput is slightly lower than 0.4 because the default bayes scorer computes a posterior per emoticon — still well over 60k sentences/s.

Scales with the database, not against it. Detection, matching and scoring are all O(text), independent of how many emoticons are in the model: a single Aho-Corasick pass anchors detection and feeds extraction, lookups are dict → row into one contiguous float32 tensor, and every statistic is precomputed at build time. Growing the emoticon database 10× or 100× leaves per-analysis cost unchanged — only the one-off build and the (compact, linear) artifact grow.


Project layout

cao-official_0.5/
  pyproject.toml       pip-installable (dist `cao-official`, CLI `cao-official`)
  cao_official/        the package
    normalize.py       NFKC canonicalization
    model.py           Emotion set, Russell 2-D, EmoResult, methods
    automaton.py       Aho-Corasick (pyahocorasick + pure-python fallback)
    langmodel.py       one-class char-LM detector for borderline spans
    database.py        build/serialize the model (cao.model + cao.model.npy)
    detect.py          face-anchored detection + mouthless fallback
    extract.py         position-aware 7-area segmentation + eye/mouth split
    score.py           5 paper methods + Naive-Bayes, cascade, decision, 2-D
    cao.py             the Cao facade (analyze / analyze_text / analyze_batch)
    cli.py / __main__  command line
    eval.py            throughput + bake-off + calibrate + bench
  reference_port/      faithful 1:1 port of the C# (regression oracle)
  tests/               behaviour tests (python tests/test_cao.py)
  docs/API.md          API reference + theory-of-kinesics primer
  README.md  CHANGELOG.md  ROADMAP.md  ANALYSIS.md  PLAN.md

Limitations & honest caveats

  • Scope: kaomoji only. Unicode emoji and Western emoticons are out of scope for now (the design extends to them — see ROADMAP; emoji was deferred this pass by decision).
  • Accuracy is data-bound. Labels reflect the released database, which is more deduplicated than the paper's full DB; e.g. ^o^ resolves where it is densest in this DB. The joy class barely surfaces because it overlaps almost entirely with fondness in this release.
  • Evaluation is in-database. The bake-off uses the labeled DB itself as gold (resubstitution-style, raw stage disabled); the in-context probe only labels 5 corpus blocks. A truly held-out, hand-labeled gold set is the top ROADMAP item.
  • The relief small-DB bias is fixed by the default bayes scorer (relief predictions 1018 → 249 on the bake-off; gold 91). uniqueFrequency and the other paper methods remain available but carry the original bias.
  • Bare 2-char bracketless faces in prose (e.g. ^^ in わー^^ね) are below the detection length floor and skipped to avoid prose false-positives; bracketed ((^^)) and 3-char+ forms are found.

For the faithful, bug-for-bug behaviour of the original system, use reference_port/cao_reference.py (validated against cao_0.2/eval/1000eval4.txt).


License

CAO is released under The BSD 3-Clause License — the whole system, including the cao-official Python package and the bundled emoticon databases.

Copyright (c) 2007-2026, Michal Ptaszynski, Pawel Maciejewski, Pawel Dybala, Rafal Rzepka, Kenji Araki.

See LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cao_official-0.5.0.tar.gz (529.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cao_official-0.5.0-py3-none-any.whl (522.9 kB view details)

Uploaded Python 3

File details

Details for the file cao_official-0.5.0.tar.gz.

File metadata

  • Download URL: cao_official-0.5.0.tar.gz
  • Upload date:
  • Size: 529.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for cao_official-0.5.0.tar.gz
Algorithm Hash digest
SHA256 a48bce1695614ecdf9f12c8d0f6f21187af796c2a44591af223b913fd5b2a8bd
MD5 c2cb9bfede7a4c916858d81d8f728330
BLAKE2b-256 96617616841b96bf81601edb6a9f01ae6c42280f46333401be0b7961b43131b0

See more details on using hashes here.

File details

Details for the file cao_official-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: cao_official-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 522.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for cao_official-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9dc2c20fd27fca504a7787cbde63450b8fff79a64cdc4cc040a08de95bcfa83
MD5 7bd5a2819f2549daec6bd31dfc8ce3f3
BLAKE2b-256 153dcd559f676265851b60c91251de2658f148e7379cdb44dc88870c0add5205

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page