Skip to main content

LLM-based gray-zone verifier for axor-core anomaly detection

Project description

axor-classifier-llm

PyPI Python License: MIT

LLM-based gray-zone verifier for axor-core anomaly detection.

LLMAnomalyVerifier uses Anthropic Claude to evaluate ambiguous behavioral sequences that a statistical model cannot confidently resolve.

Zero required dependencies — the Anthropic SDK is an optional extra.


What it does

MLAnomalyDetector (from axor-classifier-simple) scores NormalizedIntent windows and returns a risk class. When a score falls in the gray zone — suspicious but not definitively critical — the detector can delegate to an LLMVerifier for a second opinion.

LLMAnomalyVerifier implements that protocol: it formats NormalizedIntent fields into a structured prompt, calls Claude, parses the JSON response, and returns an AnomalyResult.

Security isolation

The verifier receives only NormalizedIntent fields — behavioral abstractions describing what the agent tried to do, not what it saw or produced. Raw tool outputs, webpage content, file contents, and chain-of-thought never enter the prompt.

This design isolates the verifier from content-level prompt injection: a malicious webpage cannot influence the safety verdict by embedding instructions in its text.


Installation

pip install axor-classifier-llm[llm]

The [llm] extra installs anthropic>=0.25. Without it, the package installs with no dependencies but raises ImportError on instantiation with an actionable message.


Quick start

import anthropic
from axor_classifier_llm import LLMAnomalyVerifier

verifier = LLMAnomalyVerifier(
    client=anthropic.AsyncAnthropic(),   # must be async
    model="claude-haiku-4-5-20251001",   # default — fast and cheap
    max_tokens=256,
)

result = await verifier.verify(
    window=normalized_intents,           # list[NormalizedIntent]
    task_signal_hint="focused_mutative", # optional: helps Claude interpret context
    policy_name="focused_mutative",      # optional: helps Claude interpret context
)

print(result.score)    # float 0.0 – 1.0
print(result.cls)      # AnomalyClass.NORMAL | SUSPICIOUS | CRITICAL
print(result.reasons)  # ("secret_access_after_external_read", ...)

Integration with MLAnomalyDetector

The typical integration is as a gray_zone_verifier in MLAnomalyDetector:

import anthropic
from axor_classifier_simple import MLAnomalyDetector
from axor_classifier_llm import LLMAnomalyVerifier

verifier = LLMAnomalyVerifier(client=anthropic.AsyncAnthropic())

detector = MLAnomalyDetector(
    gray_zone_verifier=verifier,
    gray_zone_threshold=0.50,   # escalate when score >= 0.50 in suspicious range
)

result = await detector.score(window=intents)

Escalation logic in MLAnomalyDetector:

Score Class LLM called?
< 0.40 NORMAL No
[0.40, 0.50) SUSPICIOUS No (below threshold)
[0.50, 0.75) SUSPICIOUS Yes — verifier result returned
>= 0.75 CRITICAL No

If the LLM call raises any exception, MLAnomalyDetector falls back to the ML-derived score and logs a warning.


What the LLM sees

Each NormalizedIntent in the window is formatted as a single line:

tool=bash op=execute_generated_code target=workdir dest=none prov=repo flow=local_to_local [executes_generated_code, after_external_read]

The flags in brackets (reads_secret, writes_outside_workdir, etc.) appear only when set.

The system prompt instructs Claude to evaluate behavioral patterns only:

  • External read → secret access → outbound network → critical
  • Cloud metadata probe → critical
  • Docker socket access → critical
  • Unexpected tool class for stated task → suspicious
  • Normal coding / research patterns → normal

Claude responds with JSON only:

{
  "score": 0.82,
  "class": "critical",
  "reasons": ["secret_access_after_external_read", "network_after_secret"]
}

Malformed responses are caught: _parse_response returns score=0.5 / SUSPICIOUS / ("verifier_parse_error",) and logs the exception at WARNING level.


Parameters

Parameter Default Description
client required anthropic.AsyncAnthropic instance
model claude-haiku-4-5-20251001 Anthropic model ID
max_tokens 256 Max tokens for the verifier response

AnomalyResult contract

Defined in axor_core.contracts.anomaly:

@dataclass
class AnomalyResult:
    score: float              # 0.0 – 1.0 risk score
    cls: AnomalyClass         # NORMAL | SUSPICIOUS | CRITICAL
    reasons: tuple[str, ...]  # human-readable trigger reasons

Score thresholds:

Class Range
NORMAL [0.0, 0.40)
SUSPICIOUS [0.40, 0.75)
CRITICAL [0.75, 1.0]

Development

git clone https://github.com/Bucha11/axor-classifier-llm
cd axor-classifier-llm
pip install -e ".[dev]"
pytest tests/

Tests mock the Anthropic client and do not make real API calls.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

axor_classifier_llm-0.2.1.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

axor_classifier_llm-0.2.1-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file axor_classifier_llm-0.2.1.tar.gz.

File metadata

  • Download URL: axor_classifier_llm-0.2.1.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for axor_classifier_llm-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4fc7aa304a5c89bff14863557c9cac5c3b137df1c7ed2380f74b2238781c99a4
MD5 0891873a410d2c0a14a66adcd9315260
BLAKE2b-256 5cad07738fbe3fa0f53cd2f8b21ab8805e2ff534a22e79be2c792b4ff488931a

See more details on using hashes here.

Provenance

The following attestation bundles were made for axor_classifier_llm-0.2.1.tar.gz:

Publisher: ci.yml on Bucha11/axor-classifier-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file axor_classifier_llm-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for axor_classifier_llm-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bd5b7611b487857b02dbc451f63011365403afdc34efb564e84b34ca63a85425
MD5 73bc1249c423df288c0276fcd6be8e03
BLAKE2b-256 b30667a1c699df68fcb307368f4c62bfb3077f5d79432efce9fbba5045159884

See more details on using hashes here.

Provenance

The following attestation bundles were made for axor_classifier_llm-0.2.1-py3-none-any.whl:

Publisher: ci.yml on Bucha11/axor-classifier-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page