Skip to main content

BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core

Project description

behave-text

← repo

Text/messaging-domain behavioral observation registry. Defines what can be observed about an actor through their written messaging activity — stylometric fingerprints, lexical patterns, interaction rhythms, and governance-role signals.

BEHAVE-TEXT operates on derived features, not raw text. Sensors hash, aggregate, and classify before emitting — the raw message content never enters a BEHAVE observation. This is a tighter constraint than BEHAVE-SHELL because the source signal is text content; the PII risk is higher.

The topic prefix is actor.observation.text (not attacker.) because chat groups include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.

Install

pip install behave-text

For local development:

pip install -e ../core/ -e ".[dev]"

Quickstart

from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for

obs = Observation(
    primitive="stylometric.capitalization_habit",
    value="lowercase",
    confidence=0.91,
    window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
    source="behave/text-sensor/stylometry.py",
)
topic = event_topic_for("stylometric.capitalization_habit")
# → "actor.observation.text.stylometric.capitalization_habit"

Public API (behave_text.spec)

Symbol Description
Observation Registry-aware subclass of behave_core.spec.Observation. Validates primitive and value against PRIMITIVE_REGISTRY.
Window Re-exported from behave_core.
ObservationValue Re-exported union type.
PRIMITIVE_REGISTRY dict[str, ValueTypeSpec] — the full primitive catalog (43 entries).
ValueKind Enum: CATEGORICAL, NUMERIC, HASH, ARRAY, FREE_STRING, BOOL.
ValueTypeSpec Pydantic model: kind, allowed values, bounds, notes.
is_known(primitive) bool — whether a primitive path is registered.
get(primitive) Returns the ValueTypeSpec; raises KeyError if unknown.
TOPIC_PREFIX "actor.observation.text"
event_topic_for(primitive) Returns the full event bus topic string.

Note: to_event_payload / from_event_payload (full round-trip helpers) are present in behave-shell but not yet implemented here — status: planned.

Primitives

43 primitives across 7 categories.


meta.* — Corpus-snapshot footprint (8 primitives)

Meta primitives describe the actor's presence in the corpus window itself — how many messages, how long a span, how densely distributed. They are not stylometric features; they are the scaffolding that other primitives assume. Several primitives (notably temporal_evolution.lifecycle_phase) implicitly depend on these quantities; meta.* makes them first-class so downstream attribution engines can access and weight them explicitly.

Primitive Kind Description
meta.total_messages numeric Raw message count for this actor in the corpus snapshot. Anchor for msg_per_day and fingerprint_confidence.
meta.corpus_span_days numeric Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from first_seen_ts / last_seen_ts.
meta.msg_per_day numeric total_messages / corpus_span_days. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero.
meta.active_days numeric Distinct calendar days (UTC) with ≥1 message. Always ≤ corpus_span_days. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40).
meta.activity_density numeric [0,1] active_days / corpus_span_days. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors.
meta.first_seen_ts free_string ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors corpus_span_days in absolute time for cross-extraction comparison.
meta.last_seen_ts free_string ISO 8601 timestamp (UTC offset) of the actor's latest message. See first_seen_ts.
meta.fingerprint_confidence categorical Qualitative reliability of this actor's full fingerprint: low, medium, high. Attribution engines should weight all other observations by this before compositing. Derivation is extractor-defined — extractors declare their heuristic in the source label (e.g. #confidence-v1).

stylometric.* — Writing style fingerprints (12 primitives)

Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist Papers study (1963): function-word frequencies alone can attribute authorship with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods to short-form Spanish chat, which introduces domain-specific challenges (short messages, informal register, code-switching, emoji). Calibration results from the Rutify corpus are noted inline where they affect interpretation.

Primitive Kind Description
stylometric.punctuation_style hash Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature.
stylometric.capitalization_habit categorical Dominant capitalization rule. lowercase = no capitals. proper = standard sentence/title case. random_caps = no consistent rule. mixed_i = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers.
stylometric.emoji_usage categorical Rate of emoji use. none, occasional, frequent, exclusive (messages rarely without emoji). Captures tone and register.
stylometric.emoji_placement categorical Emoji position relative to sentence-ending punctuation. pre_punctuation = 'Hola 😊.' post_punctuation = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit.
stylometric.message_length_class categorical Median message length bucket: short 1-5 words, medium 6-20, long 21-50, paragraph >50. See also message_length_variance_class for distribution shape.
stylometric.message_length_variance_class categorical Distribution shape of per-message word counts. tight CV<0.5 (always 1-3 words). varied 0.5≤CV<1.5 (normal mix). bimodal CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance.
stylometric.linebreak_style categorical Whether the author sends one complete thought per message or bursts multiple short sequential messages. multi_line = habitual 3-5 short messages per turn. wall_of_text = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter.
stylometric.typo_signature hash SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes tener as tenet, or porque as xq). Persistent typos are strong authorship signals because they reflect keyboard-motor habits.
stylometric.function_word_distribution_top50 hash 64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. Calibration note (2026-05-02, Rutify corpus): within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids.
stylometric.function_word_distribution_top200 hash 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like tampoco, aunque, mientras) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2.
stylometric.character_ngram_simhash hash 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. #char3gram).
stylometric.distinctive_vocabulary_signature hash 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where function_word_* captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. #tfidf-top50).

lexical.* — Vocabulary and linguistic patterns (8 primitives)

Lexical primitives characterize what and how an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, lexical primitives capture deliberate linguistic choices — vocabulary richness, how questions are formed, register.

Primitive Kind Description
lexical.vocabulary_richness numeric [0,1] Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size.
lexical.slang_density numeric [0,1] Rate of slang terms per message, against a locale-tuned slang corpus.
lexical.code_switching_rate numeric [0,1] Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer.
lexical.code_switching_matrix_language free_string BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. es-CL, es-AR). The matrix language is the grammatical scaffold; embedded languages appear as inserts.
lexical.code_switching_embedded_languages array[free_string] BCP-47 list of non-matrix languages observed in the actor's messages.
lexical.sentence_complexity_class categorical Dominant clause structure. simple = single-clause. compound = two independent clauses joined by coordinating conjunctions (pero, y, o). complex = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment.
lexical.question_formation_style categorical How questions are formed. punctuation_only = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. lexical = explicit interrogatives (¿qué, cómo, cuándo). formal = inverted subject-verb or formal register.
lexical.imperative_style categorical How commands and requests are framed. informal_directive = tú/vos imperative (dame, hazlo). formal_directive = usted imperative (hágame el favor). polite = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts.

temporal_evolution.* — Behavioral change over time (1 primitive)

Primitive Kind Description
temporal_evolution.lifecycle_phase categorical Auto-classified lifecycle stage from windowed within-corpus analysis. arrival_burst = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). stable_member = low drift across the full tenure. fluctuating_member = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). inflection_member = long-tenure actor with a real behavioral shift in at least one window-pair. declining_member = monotonically decreasing per-window message counts. unknown = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d.

network.* — Governance and role signals (2 primitives)

Network primitives capture the actor's structural role in the group — inferred from interaction patterns rather than content — and a bot detector. These are heuristic composites built from other primitives; treat them as candidate signals, not verdicts.

Primitive Kind Description
network.is_likely_bot categorical Heuristic bot detector. likely_bot when conversation_initiation_rate ≥ 0.95 AND attention_pattern = broadcast AND vocabulary_richness < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. #bot-heuristic-v1).
network.governance_role_signal categorical Heuristic role shape from interaction primitives + lifecycle. admin_pattern = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. responder_pattern = init_rate ≤ 0.45, attention reciprocal. bot_pattern = matches is_likely_bot. regular = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label.

interaction.* — Messaging behavior (6 primitives)

Interaction primitives characterize how the actor participates in conversations — timing, initiation rate, and attention patterns.

Primitive Kind Description
interaction.response_latency_class categorical How quickly the actor responds to messages directed at them. immediate <30s (suggests active monitoring or automation). fast 30s-5min. normal 5-60min. slow 1-24hr. sporadic = no consistent pattern.
interaction.conversation_initiation_rate numeric [0,1] Thread-starting messages / total messages. High rate = the actor drives conversations.
interaction.message_burst_rate categorical Whether the actor sends multiple messages per turn. habitual = almost always bursts (3+ messages before any reply). single = almost always one message per turn. Tied to stylometric.linebreak_style multi_line.
interaction.active_hours_class free_string UTC active-hours window summary (e.g. 05:00-14:00 UTC). Free string — the window shape varies by actor and doesn't fit a closed enum.
interaction.session_duration_class categorical Typical session length: short <15min, medium 15-90min, long 90min-4hr, marathon >4hr. Shares the enum with behave_shell's temporal.session_duration.
interaction.attention_pattern categorical Reply-graph centrality shape. broadcast = sends to many, replies to few (one-to-many). focused = concentrates on a small set of interlocutors. reciprocal = balanced give-and-take.

content.* — Content-derived signals, EXPERIMENTAL (6 primitives)

Content primitives are derived from message text through classifiers rather than structural/timing analysis. They carry the highest risk of false positives, are brittle to vocabulary drift, and are locale-specific. An attribution engine may choose to weight these at zero until field-validated against labeled data.

Primitive Kind Description
content.role_signal categorical Locale-tuned role-vocabulary classifier. Values: admin, seller, buyer, lurker, newbie. May be moved to a separate IOC/keyword-detection layer after Rutify testing. EXPERIMENTAL
content.transactional_language numeric [0,1] Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. EXPERIMENTAL
content.opsec_awareness numeric [0,1] Rate of security-conscious phrases. HIGH FALSE-POSITIVE RISK on casual conversation about deleting files/messages. EXPERIMENTAL
content.targeting_language array[free_string] IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. EXPERIMENTAL
content.boasting_pattern categorical Success-claim frequency: none, occasional, frequent. Corpus-dependent regex. EXPERIMENTAL
content.conflict_style categorical Dispute-tone classification: aggressive, defusing, appellate. Needs labelled training data. EXPERIMENTAL

Schema

Machine-readable JSON Schema: json/observation.schema.json

Regenerate after model changes:

python scripts/generate_schema.py

Tests

pytest tests/

Attribution recipes

attribution-recipes.md — placeholder document sketching how an external attribution engine would consume actor.observation.text.* topics to build actor profiles (credential_broker, low_skill_buyer, group_admin, etc.). Not populated yet — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.

License

Code and schemas: GPL-3.0-or-later Spec prose (this file, attribution-recipes.md): CC-BY-SA-4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

behave_text-0.1.2.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

behave_text-0.1.2-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file behave_text-0.1.2.tar.gz.

File metadata

  • Download URL: behave_text-0.1.2.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for behave_text-0.1.2.tar.gz
Algorithm Hash digest
SHA256 95eb0b48ba5d27b9919eb048b652f5b3f30d48687dbf62c850e6ab2151277dff
MD5 28cfbc5f71d4c978d02fd51bdad11dca
BLAKE2b-256 3f8fdde74ac499fdd1768b06738350904c5542850fd6be508c392ec8e06b4f38

See more details on using hashes here.

File details

Details for the file behave_text-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: behave_text-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for behave_text-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dadf35e43229dd4e146b06047c84f254f73092f3ccaa7888ef1a603ff24d8838
MD5 7b490a427dc56adf792a5f6889fddbea
BLAKE2b-256 9675e032457abda4faa45035f61c245ccfae2a791dbd39ba5c0fec4bca461862

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page