CAO Official — fast, clean emotiCon Analysis and decOding of affective information (Japanese kaomoji affect analysis)

These details have not been verified by PyPI

Project links

Project description

CAO Official 0.5.0

A fast, clean reimplementation of CAO — emotiCon Analysis and decOding of affective information — the system that detects Japanese emoticons (kaomoji, e.g. (^_^)) in text and classifies the emotion they express, grounded in Birdwhistell's theory of kinesics (an emoticon is body language split into semantic kinemes: eyes, mouth, decorations).

Based on: Ptaszynski et al., "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics", IEEE Transactions on Affective Computing, 2010. This is a from-scratch Python rewrite of the original C# system — faster, fixed, and runnable. See CHANGELOG.md for the lineage, docs/API.md for the API reference, and ANALYSIS.md for how it differs from the legacy code and the paper.

PyPI name: the bare cao is taken, so the distribution is cao-official and the import package is cao_official (CLI: cao-official).

New in 0.5: partial faces (bracketless ^o^, one-bracket (^o^, mouthless (^^) / (--)); a probabilistic Naive-Bayes scorer that is the new default (fixes the old relief bias, gives calibrated confidence); a statistical borderline detector; mmap model; batch/async API; and pip-installable packaging.

Citation

If you use CAO, please cite:

M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka, K. Araki. "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics." IEEE Transactions on Affective Computing, Vol. 1, No. 1, 2010.

@article{ptaszynski2010cao,
  title={CAO: A fully automatic emoticon analysis system based on theory of kinesics},
  author={Ptaszynski, Michal and Maciejewski, Jacek and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji},
  journal={IEEE Transactions on Affective Computing},
  volume={1},
  number={1},
  pages={46--59},
  year={2010},
  publisher={IEEE}
}

What it does

Given an emoticon or a sentence, CAO runs three procedures:

Detection — find emoticon spans in free text (face-anchored: a candidate must contain a recognized face core; brackets optional). 0.5 also finds partial faces — bracketless, single-bracket, and mouthless (eye–eye) — via a gated fallback anchor, and rejects prose/number noise.
Extraction — segment each emoticon into its seven structural areas [additional][bracket][internal][ FACE ][internal][bracket][additional] and decompose the face into eye / mouth / eye (occurrence-weighted; empty mouth allowed).
Affect analysis — score the parts against ten per-emotion databases and decide a single emotion, with a calibrated confidence and a 2-D coordinate.

The ten emotions (Nakamura): anger, dislike, excitement, fear, fondness, joy, relief, shame, sorrow, surprise — also projected onto Russell's valence × activation plane.

Install

Requires Python 3.10+ and numpy. pyahocorasick is recommended (C-accelerated matching; the code falls back to a pure-Python automaton — only ~15% slower — without it).

From PyPI:

pip install cao-official          # core
pip install "cao-official[fast]"  # + pyahocorasick (recommended)

The wheel ships a prebuilt model (cao.model + cao.model.npy), so the first Cao() loads in ~5 ms with no database needed — the package is self-contained.

From source (for development / rebuilding the model from the database):

git clone https://github.com/ptaszynski/cao-official
cd cao-official
python3 -m venv .venv && ./.venv/bin/pip install -e ".[fast,dev]"
./.venv/bin/python -m cao_official --build   # rebuild model from ../cao_0.2/data

Quickstart

Python API

from cao_official import Cao

cao = Cao()                       # loads the model (builds it once if missing)

r = cao.analyze("(｀Д´)")
print(r.label)                    # 'anger'
print(r.confidence)               # 0.71   (calibrated posterior)
print(r.valence, r.activation)    # -0.51 0.70   -> negative-activated
print(r.areas.as_row())           # ['N/A', '(', 'N/A', '`Д́', 'N/A', ')', 'N/A']
print(r.ranking()[:3])            # [('anger', 0.71), ('excitement', 0.10), ...]
print(r.attribution[0])           # ('face', '`Д́', 'anger', 88.0)  -- why

# detect + analyze every emoticon in a sentence (partial faces included)
for r in cao.analyze_text("今日は嬉しい^o^けど(--)気分"):
    print(r.span, r.emoticon, "->", r.label)

# stream over a big corpus (optionally across processes)
for i, results in cao.analyze_batch(open("corpus.txt"), workers=4):
    ...

# pick a different scoring method (the five paper methods remain available)
cao.analyze("(^o^)", method="frequency")

Web app

An interactive Streamlit demo (app.py), bilingual (English / 日本語), with single-emoticon, free-text, and whole-document (file upload) modes; it shows detection, the emotion ranking, the Russell valence×activation plot, the 7-area breakdown, and the kineme attribution. The header logo recolours to the light/dark theme automatically.

pip install "cao-official[app]"   # streamlit + altair + pandas + pillow
streamlit run app.py

Deploy on Streamlit Community Cloud: point it at this repo and app.py; the included requirements.txt installs the dependencies, the package and its prebuilt model are in the repo, so it runs as-is.

Command line

cao-official "(^_^)" "(｀Д´)"                       # after `pip install`
python -m cao_official "(^_^)" "(｀Д´)"             # or via the module
echo "嬉しい(^o^)です" | python -m cao_official --text   # detect in free text
python -m cao_official --json "orz"                  # JSON output
python -m cao_official --explain "(｀Д´)"             # show kineme attribution
python -m cao_official --method frequency "(^o^)"    # pick a scoring method
python -m cao_official --build                       # rebuild the model

Example:

emoticon: (｀Д´)
  split : N/A | ( | N/A | `Д́ | N/A | ) | N/A
  label : anger  (confidence 0.71, via bayes)
  2-D   : valence=-0.51 activation=+0.70 [negative-activated]
  top-3 : anger=0.7058, excitement=0.1009, dislike=0.07281
  why   : face='`Д́'->anger(88), left_eye+right_eye='`\t́'->anger(65), mouth='Д'->anger(42)

How it works

text ──► normalize (NFKC) ──► detect (Aho-Corasick, face-anchored + mouthless fallback)
     ──► extract (position-aware 7 areas + eye/mouth) ──► score (5 methods + Naive-Bayes)
     ──► decide (label + calibrated confidence + Russell 2-D)

One model, built once. The released per-emotion databases are parsed into contiguous float32 stat tensors (one (N, 10, 5) buffer + a dict[str → row] per table) and a face-core automaton, serialized to cao_official/cao.model (+ a sidecar cao.model.npy tensor buffer, mmappable). Runtime never re-parses text files; loading is ~5 ms.
One O(text) pass, shared. A single Aho-Corasick automaton over the normalized line replaces the legacy giant regex alternation, the per-character detection loop, and every linear database scan — and its hits drive both detection and extraction (no second scan).
Normalized matching with offset map. NFKC unifies full/half-width and combining/precomposed forms (（＾＿＾） ≡ (^_^)); a canonical→original index map lets detection work in normalized space yet report exact original spans.

Scoring methods

Each part is matched exactly against its database; the face follows the cascade raw whole-emoticon → triplet core → eyes + mouth, and the surrounding areas are added at the paper's 0.25 weight (now separately configurable for internal vs additional). Six methods are available:

method	meaning	direction
`occurrence`	raw count in the emotion DB	higher = stronger
`frequency`	occurrence ÷ total occurrences in that DB	higher = stronger
`uniqueFrequency`	occurrence ÷ #unique elements in that DB	higher = stronger
`position`	rank by occurrence (ties share)	lower = stronger
`uniquePosition`	dense rank by occurrence	lower = stronger
`bayes` (default)	smoothed Naive-Bayes posterior over the parts	higher = probability

bayes is the 0.5 default. It composes the kineme evidence as a log-product log P(e) + Σ wᵢ·log P(partᵢ|e) with Lidstone smoothing — the canonical generative model, of which the paper's frequency is a single-part special case. It yields a real posterior probability (so confidence is calibrated by temperature scaling) and removes the uniqueFrequency small-DB bias toward relief. It won the cross-validation bake-off (eval bakeoff: best macro-F1 0.322 and best top-3 69.1%; in-context probe top-1 28.1% vs 22.1% for the old default, top-3 57.8%). All methods are selectable via method= and exposed in result.scores.

Performance

operation	time
build model	~0.2 s (once)
load model	~6 ms (eager)
model artifact	~5 MB (0.8 MB pickle + 4.2 MB mmappable `.npy`)
full 1000-sentence corpus	~15 ms (~65k sentences/s end-to-end)
pure-Python Aho-Corasick	~15% slower than native `pyahocorasick`

Run the harness: python -m cao_official.eval (add bakeoff, calibrate, or bench). Throughput is slightly lower than 0.4 because the default bayes scorer computes a posterior per emoticon — still well over 60k sentences/s.

Scales with the database, not against it. Detection, matching and scoring are all O(text), independent of how many emoticons are in the model: a single Aho-Corasick pass anchors detection and feeds extraction, lookups are dict → row into one contiguous float32 tensor, and every statistic is precomputed at build time. Growing the emoticon database 10× or 100× leaves per-analysis cost unchanged — only the one-off build and the (compact, linear) artifact grow.

Project layout

cao-official_0.5/
  pyproject.toml       pip-installable (dist `cao-official`, CLI `cao-official`)
  cao_official/        the package
    normalize.py       NFKC canonicalization
    model.py           Emotion set, Russell 2-D, EmoResult, methods
    automaton.py       Aho-Corasick (pyahocorasick + pure-python fallback)
    langmodel.py       one-class char-LM detector for borderline spans
    database.py        build/serialize the model (cao.model + cao.model.npy)
    detect.py          face-anchored detection + mouthless fallback
    extract.py         position-aware 7-area segmentation + eye/mouth split
    score.py           5 paper methods + Naive-Bayes, cascade, decision, 2-D
    cao.py             the Cao facade (analyze / analyze_text / analyze_batch)
    cli.py / __main__  command line
    eval.py            throughput + bake-off + calibrate + bench
  reference_port/      faithful 1:1 port of the C# (regression oracle)
  tests/               behaviour tests (python tests/test_cao.py)
  docs/API.md          API reference + theory-of-kinesics primer
  README.md  CHANGELOG.md  ROADMAP.md  ANALYSIS.md  PLAN.md

Limitations & honest caveats

Scope: kaomoji only. Unicode emoji and Western emoticons are out of scope for now (the design extends to them — see ROADMAP; emoji was deferred this pass by decision).
Accuracy is data-bound. Labels reflect the released database, which is more deduplicated than the paper's full DB; e.g. ^o^ resolves where it is densest in this DB. The joy class barely surfaces because it overlaps almost entirely with fondness in this release.
Evaluation is in-database. The bake-off uses the labeled DB itself as gold (resubstitution-style, raw stage disabled); the in-context probe only labels 5 corpus blocks. A truly held-out, hand-labeled gold set is the top ROADMAP item.
The relief small-DB bias is fixed by the default bayes scorer (relief predictions 1018 → 249 on the bake-off; gold 91). uniqueFrequency and the other paper methods remain available but carry the original bias.
Bare 2-char bracketless faces in prose (e.g. ^^ in わー^^ね) are below the detection length floor and skipped to avoid prose false-positives; bracketed ((^^)) and 3-char+ forms are found.

For the faithful, bug-for-bug behaviour of the original system, use reference_port/cao_reference.py (validated against cao_0.2/eval/1000eval4.txt).

License

CAO is released under The BSD 3-Clause License — the whole system, including the cao-official Python package and the bundled emoticon databases.

Copyright (c) 2007-2026, Michal Ptaszynski, Pawel Maciejewski, Pawel Dybala, Rafal Rzepka, Kenji Araki.

See LICENSE for the full text.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cao_official-0.5.0.tar.gz (529.7 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cao_official-0.5.0-py3-none-any.whl (522.9 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file cao_official-0.5.0.tar.gz.

File metadata

Download URL: cao_official-0.5.0.tar.gz
Upload date: May 21, 2026
Size: 529.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for cao_official-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`a48bce1695614ecdf9f12c8d0f6f21187af796c2a44591af223b913fd5b2a8bd`
MD5	`c2cb9bfede7a4c916858d81d8f728330`
BLAKE2b-256	`96617616841b96bf81601edb6a9f01ae6c42280f46333401be0b7961b43131b0`

See more details on using hashes here.

File details

Details for the file cao_official-0.5.0-py3-none-any.whl.

File metadata

Download URL: cao_official-0.5.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 522.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for cao_official-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9dc2c20fd27fca504a7787cbde63450b8fff79a64cdc4cc040a08de95bcfa83`
MD5	`7bd5a2819f2549daec6bd31dfc8ce3f3`
BLAKE2b-256	`153dcd559f676265851b60c91251de2658f148e7379cdb44dc88870c0add5205`

See more details on using hashes here.

cao-official 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CAO Official 0.5.0

Citation

What it does

Install

Quickstart

Python API

Web app

Command line

How it works

Scoring methods

Performance

Project layout

Limitations & honest caveats

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes