BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core

These details have not been verified by PyPI

Project links

Source

Project description

behave-text

Text/messaging-domain behavioral observation registry. Defines what can be observed about an actor through their written messaging activity — stylometric fingerprints, lexical patterns, interaction rhythms, and governance-role signals.

BEHAVE-TEXT operates on derived features, not raw text. Sensors hash, aggregate, and classify before emitting — the raw message content never enters a BEHAVE observation. This is a tighter constraint than BEHAVE-SHELL because the source signal is text content; the PII risk is higher.

The topic prefix is actor.observation.text (not attacker.) because chat groups include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.

Install

pip install behave-text

For local development:

pip install -e ../core/ -e ".[dev]"

Quickstart

from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for

obs = Observation(
    primitive="stylometric.capitalization_habit",
    value="lowercase",
    confidence=0.91,
    window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
    source="behave/text-sensor/stylometry.py",
)
topic = event_topic_for("stylometric.capitalization_habit")
# → "actor.observation.text.stylometric.capitalization_habit"

Public API (`behave_text.spec`)

Symbol	Description
`Observation`	Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`.
`Window`	Re-exported from `behave_core`.
`ObservationValue`	Re-exported union type.
`PRIMITIVE_REGISTRY`	`dict[str, ValueTypeSpec]` — the full primitive catalog (47 entries).
`ValueKind`	Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`.
`ValueTypeSpec`	Pydantic model: kind, allowed values, bounds, notes.
`is_known(primitive)`	`bool` — whether a primitive path is registered.
`get(primitive)`	Returns the `ValueTypeSpec`; raises `KeyError` if unknown.
`TOPIC_PREFIX`	`"actor.observation.text"`
`event_topic_for(primitive)`	Returns the full event bus topic string.

Note: to_event_payload / from_event_payload (full round-trip helpers) are present in behave-shell but not yet implemented here — status: planned.

Primitives

47 primitives across 7 categories.

`meta.*` — Corpus-snapshot footprint (8 primitives)

Meta primitives describe the actor's presence in the corpus window itself — how many messages, how long a span, how densely distributed. They are not stylometric features; they are the scaffolding that other primitives assume. Several primitives (notably temporal_evolution.lifecycle_phase) implicitly depend on these quantities; meta.* makes them first-class so downstream attribution engines can access and weight them explicitly.

Primitive	Kind	Description
`meta.total_messages`	numeric	Raw message count for this actor in the corpus snapshot. Anchor for `msg_per_day` and `fingerprint_confidence`.
`meta.corpus_span_days`	numeric	Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from `first_seen_ts` / `last_seen_ts`.
`meta.msg_per_day`	numeric	`total_messages / corpus_span_days`. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero.
`meta.active_days`	numeric	Distinct calendar days (UTC) with ≥1 message. Always ≤ `corpus_span_days`. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40).
`meta.activity_density`	numeric [0,1]	`active_days / corpus_span_days`. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors.
`meta.first_seen_ts`	free_string	ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors `corpus_span_days` in absolute time for cross-extraction comparison.
`meta.last_seen_ts`	free_string	ISO 8601 timestamp (UTC offset) of the actor's latest message. See `first_seen_ts`.
`meta.fingerprint_confidence`	categorical	Qualitative reliability of this actor's full fingerprint: `low`, `medium`, `high`. Attribution engines should weight all other observations by this before compositing. Derivation is extractor-defined — extractors declare their heuristic in the source label (e.g. `#confidence-v1`).

`stylometric.*` — Writing style fingerprints (13 primitives)

Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist Papers study (1963): function-word frequencies alone can attribute authorship with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods to short-form Spanish chat, which introduces domain-specific challenges (short messages, informal register, code-switching, emoji). Calibration results from the Rutify corpus are noted inline where they affect interpretation.

Primitive	Kind	Description
`stylometric.punctuation_style`	hash	Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature.
`stylometric.capitalization_habit`	categorical	Dominant capitalization rule. `lowercase` = no capitals. `proper` = standard sentence/title case. `random_caps` = no consistent rule. `mixed_i` = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers.
`stylometric.emoji_usage`	categorical	Rate of emoji use. `none`, `occasional`, `frequent`, `exclusive` (messages rarely without emoji). Captures tone and register.
`stylometric.emoji_placement`	categorical	Emoji position relative to sentence-ending punctuation. `pre_punctuation` = 'Hola 😊.' `post_punctuation` = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit.
`stylometric.message_length_class`	categorical	Median message length bucket: `short` 1-5 words, `medium` 6-20, `long` 21-50, `paragraph` >50. See also `message_length_variance_class` for distribution shape.
`stylometric.message_length_variance_class`	categorical	Distribution shape of per-message word counts. `tight` CV<0.5 (always 1-3 words). `varied` 0.5≤CV<1.5 (normal mix). `bimodal` CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance.
`stylometric.linebreak_style`	categorical	Whether the author sends one complete thought per message or bursts multiple short sequential messages. `multi_line` = habitual 3-5 short messages per turn. `wall_of_text` = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter.
`stylometric.typo_signature`	hash	SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes `tener` as `tenet`, or `porque` as `xq`). Persistent typos are strong authorship signals because they reflect keyboard-motor habits.
`stylometric.function_word_distribution_top50`	hash	64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. Calibration note (2026-05-02, Rutify corpus): within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids.
`stylometric.function_word_distribution_top200`	hash	64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2.
`stylometric.character_ngram_simhash`	hash	64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`).
`stylometric.distinctive_vocabulary_signature`	hash	64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`).
`stylometric.pos_ngram_signature`	hash	64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. `#spacy-es_core_news_sm-bi`). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora.

`lexical.*` — Vocabulary and linguistic patterns (11 primitives)

Lexical primitives characterize what and how an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, lexical primitives capture deliberate linguistic choices — vocabulary richness, how questions are formed, register.

Primitive	Kind	Description
`lexical.vocabulary_richness`	numeric [0,1]	Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size.
`lexical.slang_density`	numeric [0,1]	Rate of slang terms per message, against a locale-tuned slang corpus.
`lexical.code_switching_rate`	numeric [0,1]	Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer.
`lexical.code_switching_matrix_language`	free_string	BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. `es-CL`, `es-AR`). The matrix language is the grammatical scaffold; embedded languages appear as inserts.
`lexical.code_switching_embedded_languages`	array[free_string]	BCP-47 list of non-matrix languages observed in the actor's messages.
`lexical.sentence_complexity_class`	categorical	Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment.
`lexical.question_formation_style`	categorical	How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register.
`lexical.imperative_style`	categorical	How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts.
`lexical.dialect_region`	free_string	Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, `en-US`). Detected from lexical marker density against per-region vocabulary tables. Emit literal `unknown` below confidence threshold. Detection method declared in source label (e.g. `#dialect-markers-v1`). Complementary to `code_switching_matrix_language`, which derives language via switching analysis rather than direct marker lookup.
`lexical.evaluative_morphology_density`	numeric [0,1]	Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (`-ito`/`-ita`), augmentatives (`-ón`/`-ote`), pejoratives (`-ejo`/`-ucho`), and intensives (`-azo`). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. `#eval-morph-es-v1`).
`lexical.optional_grammar_signature`	hash	64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (`he comido` vs `comí` — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (`que` vs `el cual`). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus volume for stable probabilities — gate on `meta.fingerprint_confidence` before use.

`temporal_evolution.*` — Behavioral change over time (1 primitive)

Primitive	Kind	Description
`temporal_evolution.lifecycle_phase`	categorical	Auto-classified lifecycle stage from windowed within-corpus analysis. `arrival_burst` = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). `stable_member` = low drift across the full tenure. `fluctuating_member` = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). `inflection_member` = long-tenure actor with a real behavioral shift in at least one window-pair. `declining_member` = monotonically decreasing per-window message counts. `unknown` = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d.

`network.*` — Governance and role signals (2 primitives)

Network primitives capture the actor's structural role in the group — inferred from interaction patterns rather than content — and a bot detector. These are heuristic composites built from other primitives; treat them as candidate signals, not verdicts.

Primitive	Kind	Description
`network.is_likely_bot`	categorical	Heuristic bot detector. `likely_bot` when `conversation_initiation_rate` ≥ 0.95 AND `attention_pattern` = `broadcast` AND `vocabulary_richness` < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. `#bot-heuristic-v1`).
`network.governance_role_signal`	categorical	Heuristic role shape from interaction primitives + lifecycle. `admin_pattern` = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. `responder_pattern` = init_rate ≤ 0.45, attention reciprocal. `bot_pattern` = matches `is_likely_bot`. `regular` = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label.

`interaction.*` — Messaging behavior (6 primitives)

Interaction primitives characterize how the actor participates in conversations — timing, initiation rate, and attention patterns.

Primitive	Kind	Description
`interaction.response_latency_class`	categorical	How quickly the actor responds to messages directed at them. `immediate` <30s (suggests active monitoring or automation). `fast` 30s-5min. `normal` 5-60min. `slow` 1-24hr. `sporadic` = no consistent pattern.
`interaction.conversation_initiation_rate`	numeric [0,1]	Thread-starting messages / total messages. High rate = the actor drives conversations.
`interaction.message_burst_rate`	categorical	Whether the actor sends multiple messages per turn. `habitual` = almost always bursts (3+ messages before any reply). `single` = almost always one message per turn. Tied to `stylometric.linebreak_style multi_line`.
`interaction.active_hours_class`	free_string	UTC active-hours window summary (e.g. `05:00-14:00 UTC`). Free string — the window shape varies by actor and doesn't fit a closed enum.
`interaction.session_duration_class`	categorical	Typical session length: `short` <15min, `medium` 15-90min, `long` 90min-4hr, `marathon` >4hr. Shares the enum with `behave_shell`'s `temporal.session_duration`.
`interaction.attention_pattern`	categorical	Reply-graph centrality shape. `broadcast` = sends to many, replies to few (one-to-many). `focused` = concentrates on a small set of interlocutors. `reciprocal` = balanced give-and-take.

`content.*` — Content-derived signals, EXPERIMENTAL (6 primitives)

Content primitives are derived from message text through classifiers rather than structural/timing analysis. They carry the highest risk of false positives, are brittle to vocabulary drift, and are locale-specific. An attribution engine may choose to weight these at zero until field-validated against labeled data.

Primitive	Kind	Description
`content.role_signal`	categorical	Locale-tuned role-vocabulary classifier. Values: `admin`, `seller`, `buyer`, `lurker`, `newbie`. May be moved to a separate IOC/keyword-detection layer after Rutify testing. `EXPERIMENTAL`
`content.transactional_language`	numeric [0,1]	Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. `EXPERIMENTAL`
`content.opsec_awareness`	numeric [0,1]	Rate of security-conscious phrases. HIGH FALSE-POSITIVE RISK on casual conversation about deleting files/messages. `EXPERIMENTAL`
`content.targeting_language`	array[free_string]	IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. `EXPERIMENTAL`
`content.boasting_pattern`	categorical	Success-claim frequency: `none`, `occasional`, `frequent`. Corpus-dependent regex. `EXPERIMENTAL`
`content.conflict_style`	categorical	Dispute-tone classification: `aggressive`, `defusing`, `appellate`. Needs labelled training data. `EXPERIMENTAL`

Schema

Machine-readable JSON Schema: json/observation.schema.json

Regenerate after model changes:

python scripts/generate_schema.py

Tests

pytest tests/

Attribution recipes

attribution-recipes.md — placeholder document sketching how an external attribution engine would consume actor.observation.text.* topics to build actor profiles (credential_broker, low_skill_buyer, group_admin, etc.). Not populated yet — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.

License

Code and schemas: GPL-3.0-or-later Spec prose (this file, attribution-recipes.md): CC-BY-SA-4.0

Project details

These details have not been verified by PyPI

Project links

Source

Release history Release notifications | RSS feed

This version

0.1.3

May 23, 2026

0.1.2

May 23, 2026

0.1.1

May 18, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

behave_text-0.1.3.tar.gz (27.0 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

behave_text-0.1.3-py3-none-any.whl (21.4 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file behave_text-0.1.3.tar.gz.

File metadata

Download URL: behave_text-0.1.3.tar.gz
Upload date: May 23, 2026
Size: 27.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for behave_text-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`ff15004178d98c46b49908bf4d3cd5f0fdaed493eb7b5585121aa82f2f853440`
MD5	`a1fbae6bddae528cedf479db0942f827`
BLAKE2b-256	`e6451864dc50a846835dac116123257886b53290e6da226cc76b6118f313d8df`

See more details on using hashes here.

File details

Details for the file behave_text-0.1.3-py3-none-any.whl.

File metadata

Download URL: behave_text-0.1.3-py3-none-any.whl
Upload date: May 23, 2026
Size: 21.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for behave_text-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`392fcfc21a4b78e67f1257bf67f38a989d43015ba8674850c30c8d15b6998370`
MD5	`150d146feb4c3aadd35e5ad9d20b208e`
BLAKE2b-256	`959b335f6f55eb7e177b93b688f9c20b88b4f383d0a3b16b64b25303d054f57f`

See more details on using hashes here.

behave-text 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

behave-text

Install

Quickstart

Public API (`behave_text.spec`)

Primitives

`meta.*` — Corpus-snapshot footprint (8 primitives)

`stylometric.*` — Writing style fingerprints (13 primitives)

`lexical.*` — Vocabulary and linguistic patterns (11 primitives)

`temporal_evolution.*` — Behavioral change over time (1 primitive)

`network.*` — Governance and role signals (2 primitives)

`interaction.*` — Messaging behavior (6 primitives)

`content.*` — Content-derived signals, EXPERIMENTAL (6 primitives)

Schema

Tests

Attribution recipes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

behave-text 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

behave-text

Install

Quickstart

Public API (behave_text.spec)

Primitives

meta.* — Corpus-snapshot footprint (8 primitives)

stylometric.* — Writing style fingerprints (13 primitives)

lexical.* — Vocabulary and linguistic patterns (11 primitives)

temporal_evolution.* — Behavioral change over time (1 primitive)

network.* — Governance and role signals (2 primitives)

interaction.* — Messaging behavior (6 primitives)

content.* — Content-derived signals, EXPERIMENTAL (6 primitives)

Schema

Tests

Attribution recipes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Public API (`behave_text.spec`)

`meta.*` — Corpus-snapshot footprint (8 primitives)

`stylometric.*` — Writing style fingerprints (13 primitives)

`lexical.*` — Vocabulary and linguistic patterns (11 primitives)

`temporal_evolution.*` — Behavioral change over time (1 primitive)

`network.*` — Governance and role signals (2 primitives)

`interaction.*` — Messaging behavior (6 primitives)

`content.*` — Content-derived signals, EXPERIMENTAL (6 primitives)