BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core
Project description
behave-text
Text/messaging-domain behavioral observation registry. Defines what can be observed about an actor through their written messaging activity — stylometric fingerprints, lexical patterns, interaction rhythms, and governance-role signals.
BEHAVE-TEXT operates on derived features, not raw text. Sensors hash, aggregate, and classify before emitting — the raw message content never enters a BEHAVE observation. This is a tighter constraint than BEHAVE-SHELL because the source signal is text content; the PII risk is higher.
The topic prefix is actor.observation.text (not attacker.) because chat groups
include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing
is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.
Install
pip install behave-text
For local development:
pip install -e ../core/ -e ".[dev]"
Quickstart
from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for
obs = Observation(
primitive="stylometric.capitalization_habit",
value="lowercase",
confidence=0.91,
window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
source="behave/text-sensor/stylometry.py",
)
topic = event_topic_for("stylometric.capitalization_habit")
# → "actor.observation.text.stylometric.capitalization_habit"
Public API (behave_text.spec)
| Symbol | Description |
|---|---|
Observation |
Registry-aware subclass of behave_core.spec.Observation. Validates primitive and value against PRIMITIVE_REGISTRY. |
Window |
Re-exported from behave_core. |
ObservationValue |
Re-exported union type. |
PRIMITIVE_REGISTRY |
dict[str, ValueTypeSpec] — the full primitive catalog (47 entries). |
ValueKind |
Enum: CATEGORICAL, NUMERIC, HASH, ARRAY, FREE_STRING, BOOL. |
ValueTypeSpec |
Pydantic model: kind, allowed values, bounds, notes. |
is_known(primitive) |
bool — whether a primitive path is registered. |
get(primitive) |
Returns the ValueTypeSpec; raises KeyError if unknown. |
TOPIC_PREFIX |
"actor.observation.text" |
event_topic_for(primitive) |
Returns the full event bus topic string. |
Note: to_event_payload / from_event_payload (full round-trip helpers) are
present in behave-shell but not yet implemented here — status: planned.
Primitives
47 primitives across 7 categories.
meta.* — Corpus-snapshot footprint (8 primitives)
Meta primitives describe the actor's presence in the corpus window itself —
how many messages, how long a span, how densely distributed. They are not
stylometric features; they are the scaffolding that other primitives assume.
Several primitives (notably temporal_evolution.lifecycle_phase) implicitly
depend on these quantities; meta.* makes them first-class so downstream
attribution engines can access and weight them explicitly.
| Primitive | Kind | Description |
|---|---|---|
meta.total_messages |
numeric | Raw message count for this actor in the corpus snapshot. Anchor for msg_per_day and fingerprint_confidence. |
meta.corpus_span_days |
numeric | Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from first_seen_ts / last_seen_ts. |
meta.msg_per_day |
numeric | total_messages / corpus_span_days. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero. |
meta.active_days |
numeric | Distinct calendar days (UTC) with ≥1 message. Always ≤ corpus_span_days. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40). |
meta.activity_density |
numeric [0,1] | active_days / corpus_span_days. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors. |
meta.first_seen_ts |
free_string | ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors corpus_span_days in absolute time for cross-extraction comparison. |
meta.last_seen_ts |
free_string | ISO 8601 timestamp (UTC offset) of the actor's latest message. See first_seen_ts. |
meta.fingerprint_confidence |
categorical | Qualitative reliability of this actor's full fingerprint: low, medium, high. Attribution engines should weight all other observations by this before compositing. Derivation is extractor-defined — extractors declare their heuristic in the source label (e.g. #confidence-v1). |
stylometric.* — Writing style fingerprints (13 primitives)
Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist Papers study (1963): function-word frequencies alone can attribute authorship with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods to short-form Spanish chat, which introduces domain-specific challenges (short messages, informal register, code-switching, emoji). Calibration results from the Rutify corpus are noted inline where they affect interpretation.
| Primitive | Kind | Description |
|---|---|---|
stylometric.punctuation_style |
hash | Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature. |
stylometric.capitalization_habit |
categorical | Dominant capitalization rule. lowercase = no capitals. proper = standard sentence/title case. random_caps = no consistent rule. mixed_i = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers. |
stylometric.emoji_usage |
categorical | Rate of emoji use. none, occasional, frequent, exclusive (messages rarely without emoji). Captures tone and register. |
stylometric.emoji_placement |
categorical | Emoji position relative to sentence-ending punctuation. pre_punctuation = 'Hola 😊.' post_punctuation = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit. |
stylometric.message_length_class |
categorical | Median message length bucket: short 1-5 words, medium 6-20, long 21-50, paragraph >50. See also message_length_variance_class for distribution shape. |
stylometric.message_length_variance_class |
categorical | Distribution shape of per-message word counts. tight CV<0.5 (always 1-3 words). varied 0.5≤CV<1.5 (normal mix). bimodal CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance. |
stylometric.linebreak_style |
categorical | Whether the author sends one complete thought per message or bursts multiple short sequential messages. multi_line = habitual 3-5 short messages per turn. wall_of_text = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter. |
stylometric.typo_signature |
hash | SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes tener as tenet, or porque as xq). Persistent typos are strong authorship signals because they reflect keyboard-motor habits. |
stylometric.function_word_distribution_top50 |
hash | 64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. Calibration note (2026-05-02, Rutify corpus): within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids. |
stylometric.function_word_distribution_top200 |
hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like tampoco, aunque, mientras) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
stylometric.character_ngram_simhash |
hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. #char3gram). |
stylometric.distinctive_vocabulary_signature |
hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where function_word_* captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. #tfidf-top50). |
stylometric.pos_ngram_signature |
hash | 64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. #spacy-es_core_news_sm-bi). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora. |
lexical.* — Vocabulary and linguistic patterns (11 primitives)
Lexical primitives characterize what and how an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, lexical primitives capture deliberate linguistic choices — vocabulary richness, how questions are formed, register.
| Primitive | Kind | Description |
|---|---|---|
lexical.vocabulary_richness |
numeric [0,1] | Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size. |
lexical.slang_density |
numeric [0,1] | Rate of slang terms per message, against a locale-tuned slang corpus. |
lexical.code_switching_rate |
numeric [0,1] | Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer. |
lexical.code_switching_matrix_language |
free_string | BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. es-CL, es-AR). The matrix language is the grammatical scaffold; embedded languages appear as inserts. |
lexical.code_switching_embedded_languages |
array[free_string] | BCP-47 list of non-matrix languages observed in the actor's messages. |
lexical.sentence_complexity_class |
categorical | Dominant clause structure. simple = single-clause. compound = two independent clauses joined by coordinating conjunctions (pero, y, o). complex = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
lexical.question_formation_style |
categorical | How questions are formed. punctuation_only = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. lexical = explicit interrogatives (¿qué, cómo, cuándo). formal = inverted subject-verb or formal register. |
lexical.imperative_style |
categorical | How commands and requests are framed. informal_directive = tú/vos imperative (dame, hazlo). formal_directive = usted imperative (hágame el favor). polite = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
lexical.dialect_region |
free_string | Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. es-CL, es-AR, es-MX, es-ES, en-US). Detected from lexical marker density against per-region vocabulary tables. Emit literal unknown below confidence threshold. Detection method declared in source label (e.g. #dialect-markers-v1). Complementary to code_switching_matrix_language, which derives language via switching analysis rather than direct marker lookup. |
lexical.evaluative_morphology_density |
numeric [0,1] | Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (-ito/-ita), augmentatives (-ón/-ote), pejoratives (-ejo/-ucho), and intensives (-azo). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. #eval-morph-es-v1). |
lexical.optional_grammar_signature |
hash | 64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (he comido vs comí — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (que vs el cual). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. #optgrammar-es-v1). Requires sufficient corpus volume for stable probabilities — gate on meta.fingerprint_confidence before use. |
temporal_evolution.* — Behavioral change over time (1 primitive)
| Primitive | Kind | Description |
|---|---|---|
temporal_evolution.lifecycle_phase |
categorical | Auto-classified lifecycle stage from windowed within-corpus analysis. arrival_burst = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). stable_member = low drift across the full tenure. fluctuating_member = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). inflection_member = long-tenure actor with a real behavioral shift in at least one window-pair. declining_member = monotonically decreasing per-window message counts. unknown = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d. |
network.* — Governance and role signals (2 primitives)
Network primitives capture the actor's structural role in the group — inferred from interaction patterns rather than content — and a bot detector. These are heuristic composites built from other primitives; treat them as candidate signals, not verdicts.
| Primitive | Kind | Description |
|---|---|---|
network.is_likely_bot |
categorical | Heuristic bot detector. likely_bot when conversation_initiation_rate ≥ 0.95 AND attention_pattern = broadcast AND vocabulary_richness < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. #bot-heuristic-v1). |
network.governance_role_signal |
categorical | Heuristic role shape from interaction primitives + lifecycle. admin_pattern = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. responder_pattern = init_rate ≤ 0.45, attention reciprocal. bot_pattern = matches is_likely_bot. regular = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label. |
interaction.* — Messaging behavior (6 primitives)
Interaction primitives characterize how the actor participates in conversations — timing, initiation rate, and attention patterns.
| Primitive | Kind | Description |
|---|---|---|
interaction.response_latency_class |
categorical | How quickly the actor responds to messages directed at them. immediate <30s (suggests active monitoring or automation). fast 30s-5min. normal 5-60min. slow 1-24hr. sporadic = no consistent pattern. |
interaction.conversation_initiation_rate |
numeric [0,1] | Thread-starting messages / total messages. High rate = the actor drives conversations. |
interaction.message_burst_rate |
categorical | Whether the actor sends multiple messages per turn. habitual = almost always bursts (3+ messages before any reply). single = almost always one message per turn. Tied to stylometric.linebreak_style multi_line. |
interaction.active_hours_class |
free_string | UTC active-hours window summary (e.g. 05:00-14:00 UTC). Free string — the window shape varies by actor and doesn't fit a closed enum. |
interaction.session_duration_class |
categorical | Typical session length: short <15min, medium 15-90min, long 90min-4hr, marathon >4hr. Shares the enum with behave_shell's temporal.session_duration. |
interaction.attention_pattern |
categorical | Reply-graph centrality shape. broadcast = sends to many, replies to few (one-to-many). focused = concentrates on a small set of interlocutors. reciprocal = balanced give-and-take. |
content.* — Content-derived signals, EXPERIMENTAL (6 primitives)
Content primitives are derived from message text through classifiers rather than structural/timing analysis. They carry the highest risk of false positives, are brittle to vocabulary drift, and are locale-specific. An attribution engine may choose to weight these at zero until field-validated against labeled data.
| Primitive | Kind | Description |
|---|---|---|
content.role_signal |
categorical | Locale-tuned role-vocabulary classifier. Values: admin, seller, buyer, lurker, newbie. May be moved to a separate IOC/keyword-detection layer after Rutify testing. EXPERIMENTAL |
content.transactional_language |
numeric [0,1] | Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. EXPERIMENTAL |
content.opsec_awareness |
numeric [0,1] | Rate of security-conscious phrases. HIGH FALSE-POSITIVE RISK on casual conversation about deleting files/messages. EXPERIMENTAL |
content.targeting_language |
array[free_string] | IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. EXPERIMENTAL |
content.boasting_pattern |
categorical | Success-claim frequency: none, occasional, frequent. Corpus-dependent regex. EXPERIMENTAL |
content.conflict_style |
categorical | Dispute-tone classification: aggressive, defusing, appellate. Needs labelled training data. EXPERIMENTAL |
Schema
Machine-readable JSON Schema:
json/observation.schema.json
Regenerate after model changes:
python scripts/generate_schema.py
Tests
pytest tests/
Attribution recipes
attribution-recipes.md — placeholder document sketching
how an external attribution engine would consume actor.observation.text.* topics
to build actor profiles (credential_broker, low_skill_buyer, group_admin, etc.).
Not populated yet — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.
License
Code and schemas: GPL-3.0-or-later Spec prose (this file, attribution-recipes.md): CC-BY-SA-4.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file behave_text-0.1.3.tar.gz.
File metadata
- Download URL: behave_text-0.1.3.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff15004178d98c46b49908bf4d3cd5f0fdaed493eb7b5585121aa82f2f853440
|
|
| MD5 |
a1fbae6bddae528cedf479db0942f827
|
|
| BLAKE2b-256 |
e6451864dc50a846835dac116123257886b53290e6da226cc76b6118f313d8df
|
File details
Details for the file behave_text-0.1.3-py3-none-any.whl.
File metadata
- Download URL: behave_text-0.1.3-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
392fcfc21a4b78e67f1257bf67f38a989d43015ba8674850c30c8d15b6998370
|
|
| MD5 |
150d146feb4c3aadd35e5ad9d20b208e
|
|
| BLAKE2b-256 |
959b335f6f55eb7e177b93b688f9c20b88b4f383d0a3b16b64b25303d054f57f
|