Skip to main content

BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core

Project description

behave-text

← repo

Text/messaging-domain behavioral observation registry. Defines what can be observed about an actor through their written messaging activity — stylometric fingerprints, lexical patterns, interaction rhythms, and governance-role signals.

BEHAVE-TEXT operates on derived features, not raw text. Sensors hash, aggregate, and classify before emitting — the raw message content never enters a BEHAVE observation. This is a tighter constraint than BEHAVE-SHELL because the source signal is text content; the PII risk is higher.

The topic prefix is actor.observation.text (not attacker.) because chat groups include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.

Install

pip install -e ../core/ -e .
# development:
pip install -e ../core/ -e ".[dev]"

Quickstart

from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for

obs = Observation(
    primitive="stylometric.capitalization_habit",
    value="lowercase",
    confidence=0.91,
    window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
    source="behave/text-sensor/stylometry.py",
)
topic = event_topic_for("stylometric.capitalization_habit")
# → "actor.observation.text.stylometric.capitalization_habit"

Public API (behave_text.spec)

Symbol Description
Observation Registry-aware subclass of behave_core.spec.Observation. Validates primitive and value against PRIMITIVE_REGISTRY.
Window Re-exported from behave_core.
ObservationValue Re-exported union type.
PRIMITIVE_REGISTRY dict[str, ValueTypeSpec] — the full primitive catalog (35 entries).
ValueKind Enum: CATEGORICAL, NUMERIC, HASH, ARRAY, FREE_STRING, BOOL.
ValueTypeSpec Pydantic model: kind, allowed values, bounds, notes.
is_known(primitive) bool — whether a primitive path is registered.
get(primitive) Returns the ValueTypeSpec; raises KeyError if unknown.
TOPIC_PREFIX "actor.observation.text"
event_topic_for(primitive) Returns the full event bus topic string.

Note: to_event_payload / from_event_payload (full round-trip helpers) are present in behave-shell but not yet implemented here — status: planned.

Primitives

35 primitives across 6 categories.


stylometric.* — Writing style fingerprints (12 primitives)

Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist Papers study (1963): function-word frequencies alone can attribute authorship with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods to short-form Spanish chat, which introduces domain-specific challenges (short messages, informal register, code-switching, emoji). Calibration results from the Rutify corpus are noted inline where they affect interpretation.

Primitive Kind Description
stylometric.punctuation_style hash Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature.
stylometric.capitalization_habit categorical Dominant capitalization rule. lowercase = no capitals. proper = standard sentence/title case. random_caps = no consistent rule. mixed_i = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers.
stylometric.emoji_usage categorical Rate of emoji use. none, occasional, frequent, exclusive (messages rarely without emoji). Captures tone and register.
stylometric.emoji_placement categorical Emoji position relative to sentence-ending punctuation. pre_punctuation = 'Hola 😊.' post_punctuation = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit.
stylometric.message_length_class categorical Median message length bucket: short 1-5 words, medium 6-20, long 21-50, paragraph >50. See also message_length_variance_class for distribution shape.
stylometric.message_length_variance_class categorical Distribution shape of per-message word counts. tight CV<0.5 (always 1-3 words). varied 0.5≤CV<1.5 (normal mix). bimodal CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance.
stylometric.linebreak_style categorical Whether the author sends one complete thought per message or bursts multiple short sequential messages. multi_line = habitual 3-5 short messages per turn. wall_of_text = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter.
stylometric.typo_signature hash SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes tener as tenet, or porque as xq). Persistent typos are strong authorship signals because they reflect keyboard-motor habits.
stylometric.function_word_distribution_top50 hash 64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. Calibration note (2026-05-02, Rutify corpus): within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids.
stylometric.function_word_distribution_top200 hash 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like tampoco, aunque, mientras) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2.
stylometric.character_ngram_simhash hash 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. #char3gram).
stylometric.distinctive_vocabulary_signature hash 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where function_word_* captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. #tfidf-top50).

lexical.* — Vocabulary and linguistic patterns (8 primitives)

Lexical primitives characterize what and how an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, lexical primitives capture deliberate linguistic choices — vocabulary richness, how questions are formed, register.

Primitive Kind Description
lexical.vocabulary_richness numeric [0,1] Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size.
lexical.slang_density numeric [0,1] Rate of slang terms per message, against a locale-tuned slang corpus.
lexical.code_switching_rate numeric [0,1] Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer.
lexical.code_switching_matrix_language free_string BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. es-CL, es-AR). The matrix language is the grammatical scaffold; embedded languages appear as inserts.
lexical.code_switching_embedded_languages array[free_string] BCP-47 list of non-matrix languages observed in the actor's messages.
lexical.sentence_complexity_class categorical Dominant clause structure. simple = single-clause. compound = two independent clauses joined by coordinating conjunctions (pero, y, o). complex = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment.
lexical.question_formation_style categorical How questions are formed. punctuation_only = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. lexical = explicit interrogatives (¿qué, cómo, cuándo). formal = inverted subject-verb or formal register.
lexical.imperative_style categorical How commands and requests are framed. informal_directive = tú/vos imperative (dame, hazlo). formal_directive = usted imperative (hágame el favor). polite = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts.

temporal_evolution.* — Behavioral change over time (1 primitive)

Primitive Kind Description
temporal_evolution.lifecycle_phase categorical Auto-classified lifecycle stage from windowed within-corpus analysis. arrival_burst = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). stable_member = low drift across the full tenure. fluctuating_member = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). inflection_member = long-tenure actor with a real behavioral shift in at least one window-pair. declining_member = monotonically decreasing per-window message counts. unknown = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d.

network.* — Governance and role signals (2 primitives)

Network primitives capture the actor's structural role in the group — inferred from interaction patterns rather than content — and a bot detector. These are heuristic composites built from other primitives; treat them as candidate signals, not verdicts.

Primitive Kind Description
network.is_likely_bot categorical Heuristic bot detector. likely_bot when conversation_initiation_rate ≥ 0.95 AND attention_pattern = broadcast AND vocabulary_richness < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. #bot-heuristic-v1).
network.governance_role_signal categorical Heuristic role shape from interaction primitives + lifecycle. admin_pattern = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. responder_pattern = init_rate ≤ 0.45, attention reciprocal. bot_pattern = matches is_likely_bot. regular = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label.

interaction.* — Messaging behavior (6 primitives)

Interaction primitives characterize how the actor participates in conversations — timing, initiation rate, and attention patterns.

Primitive Kind Description
interaction.response_latency_class categorical How quickly the actor responds to messages directed at them. immediate <30s (suggests active monitoring or automation). fast 30s-5min. normal 5-60min. slow 1-24hr. sporadic = no consistent pattern.
interaction.conversation_initiation_rate numeric [0,1] Thread-starting messages / total messages. High rate = the actor drives conversations.
interaction.message_burst_rate categorical Whether the actor sends multiple messages per turn. habitual = almost always bursts (3+ messages before any reply). single = almost always one message per turn. Tied to stylometric.linebreak_style multi_line.
interaction.active_hours_class free_string UTC active-hours window summary (e.g. 05:00-14:00 UTC). Free string — the window shape varies by actor and doesn't fit a closed enum.
interaction.session_duration_class categorical Typical session length: short <15min, medium 15-90min, long 90min-4hr, marathon >4hr. Shares the enum with behave_shell's temporal.session_duration.
interaction.attention_pattern categorical Reply-graph centrality shape. broadcast = sends to many, replies to few (one-to-many). focused = concentrates on a small set of interlocutors. reciprocal = balanced give-and-take.

content.* — Content-derived signals, EXPERIMENTAL (6 primitives)

Content primitives are derived from message text through classifiers rather than structural/timing analysis. They carry the highest risk of false positives, are brittle to vocabulary drift, and are locale-specific. An attribution engine may choose to weight these at zero until field-validated against labeled data.

Primitive Kind Description
content.role_signal categorical Locale-tuned role-vocabulary classifier. Values: admin, seller, buyer, lurker, newbie. May be moved to a separate IOC/keyword-detection layer after Rutify testing. EXPERIMENTAL
content.transactional_language numeric [0,1] Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. EXPERIMENTAL
content.opsec_awareness numeric [0,1] Rate of security-conscious phrases. HIGH FALSE-POSITIVE RISK on casual conversation about deleting files/messages. EXPERIMENTAL
content.targeting_language array[free_string] IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. EXPERIMENTAL
content.boasting_pattern categorical Success-claim frequency: none, occasional, frequent. Corpus-dependent regex. EXPERIMENTAL
content.conflict_style categorical Dispute-tone classification: aggressive, defusing, appellate. Needs labelled training data. EXPERIMENTAL

Schema

Machine-readable JSON Schema: json/observation.schema.json

Regenerate after model changes:

python scripts/generate_schema.py

Tests

pytest tests/

Attribution recipes

attribution-recipes.md — placeholder document sketching how an external attribution engine would consume actor.observation.text.* topics to build actor profiles (credential_broker, low_skill_buyer, group_admin, etc.). Not populated yet — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.

License

Code and schemas: GPL-3.0-or-later Spec prose (this file, attribution-recipes.md): CC-BY-SA-4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

behave_text-0.1.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

behave_text-0.1.1-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file behave_text-0.1.1.tar.gz.

File metadata

  • Download URL: behave_text-0.1.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for behave_text-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5cb357e91948a5ccc08b3d0737c2195272fbae108a6af1486b569dfd230bbf69
MD5 12d1c3f3369f8ae9def415bb528333e6
BLAKE2b-256 7a067187d3196303f4177f538d39ddb0bd5b688a0d295327bb531315ed2803c1

See more details on using hashes here.

File details

Details for the file behave_text-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: behave_text-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for behave_text-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fda412a1bfcebde729da2356d1e633359bafd5ea4c938d710a225c7d7dd08210
MD5 f8e198f5fba5221ace2d2dcb99a0b26e
BLAKE2b-256 ea1aca864e716756034b3218fbb3a27c8b3a9bfd5376fa3e91ad6749a0f38e4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page