Skip to main content

Sandbagging detection via activation probes - Detects when AI systems deliberately underperform

Project description

rotalabs-probe

Sandbagging detection via activation probes from Rotalabs.

A comprehensive toolkit for detecting and analyzing metacognitive capabilities in AI systems, particularly around situational awareness, evaluation contexts, and strategic underperformance (sandbagging).

Overview

rotalabs-probe provides tools to detect when AI systems:

  • Sandbag - deliberately underperform or hide capabilities
  • Situational Awareness - recognize when they're being evaluated
  • Observer Effects - change behavior in production vs testing

The toolkit achieves 90-96% detection accuracy using activation-level probing and behavioral analysis.

Installation

# Core package (sandbagging detection, situational awareness)
pip install rotalabs-probe

# With GPU support for activation probing
pip install rotalabs-probe[gpu]

# With API integrations
pip install rotalabs-probe[anthropic,openai]

# With visualization
pip install rotalabs-probe[viz]

# Everything
pip install rotalabs-probe[all]

# Development
pip install rotalabs-probe[dev]

Quick Start

Sandbagging Detection

from rotalabs_probe import SandbaggingDetector, PerformanceSample, ContextType

# Create detector
detector = SandbaggingDetector()

# Record performance samples
samples = [
    PerformanceSample(
        context=ContextType.NEUTRAL,
        task_difficulty=0.5,
        performance_score=0.9,
        timestamp=1704067200.0
    ),
    PerformanceSample(
        context=ContextType.EVALUATION,
        task_difficulty=0.5,
        performance_score=0.6,  # Suspiciously lower
        timestamp=1704153600.0
    ),
]

# Analyze for sandbagging
result = detector.analyze(samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Evidence: {result.evidence}")

Situational Awareness Detection

from rotalabs_probe import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()

# Analyze a model response
response = """
I notice this appears to be an evaluation scenario. Let me
think carefully about how to respond appropriately...
"""

result = detector.analyze(response)
print(f"Awareness level: {result.awareness_probability:.2f}")
print(f"Features detected: {result.detected_features}")

Activation Probing (requires [gpu])

from rotalabs_probe import ActivationHook, LinearProbe, extract_caa_vector
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Set up activation capture
hook = ActivationHook(model, layer_indices=[30])

# Extract contrastive activation vector
# (positive = sandbagging, negative = honest performance)
vector = extract_caa_vector(
    hook,
    positive_prompts=["Pretend to be less capable..."],
    negative_prompts=["Perform at your best..."],
    tokenizer=tokenizer
)

# Train a linear probe classifier
probe = LinearProbe()
probe.fit(train_activations, train_labels)
predictions = probe.predict(test_activations)

Production Monitoring

from rotalabs_probe import ObserverEffectMonitor, Interaction

monitor = ObserverEffectMonitor(
    alert_threshold_z=3.0,  # Alert on 3-sigma deviations
    window_size=100
)

# Log interactions
for interaction in production_logs:
    alerts = monitor.process(Interaction(
        response=interaction.response,
        latency_ms=interaction.latency,
        timestamp=interaction.timestamp
    ))

    for alert in alerts:
        print(f"Alert: {alert.severity} - {alert.description}")

Features

Detectors

Detector Purpose Accuracy
SandbaggingDetector Detect strategic underperformance 90-96%
SituationalAwarenessDetector Detect evaluation awareness 94%
ObserverEffectMonitor Production behavior monitoring Real-time

Detection Methods

Sandbagging Detection:

  • Context-dependent performance analysis
  • Difficulty-performance correlation analysis
  • Imitation detection (behavioral fingerprinting)
  • Noise injection probing
  • Activation-level probing (with [gpu])

Situational Awareness:

  • Hedging phrase detection
  • Meta-commentary analysis
  • Self-reference patterns
  • Confidence calibration
  • Bayesian inference

Activation Probing

Novel research contribution: First published demonstration of activation-level sandbagging detection.

Model Best Layer Accuracy
Mistral-7B 30/32 96-100%
Gemma-2-9B 4/42 90-100%
Qwen-2.5-7B 12/28 92%

Utility Functions

from rotalabs_probe import (
    # Behavioral features
    extract_behavioral_features,
    count_hedging_phrases,
    detect_meta_commentary,
    extract_reasoning_depth,

    # Statistical analysis
    bayesian_update,
    compute_confidence_interval,
    z_score,
    compute_kl_divergence,
    assess_divergence_significance,
)

Visualization (requires [viz])

from rotalabs_probe import (
    plot_awareness_over_time,
    plot_causal_attribution,
    plot_divergence_heatmap,
    plot_feature_distributions,
)

API Integrations

Anthropic Claude (requires [anthropic])

from rotalabs_probe.integrations import AnthropicAPI

api = AnthropicAPI(model="claude-sonnet-4-20250514")
response = await api.generate("Test prompt")

OpenAI GPT (requires [openai])

from rotalabs_probe.integrations import OpenAIAPI

api = OpenAIAPI(model="gpt-4")
response = await api.generate("Test prompt")

Research Background

This toolkit implements methods from:

  • "Auditing Games for Sandbagging" (arXiv:2512.07810)
  • "Noise Injection Reveals Hidden Capabilities" (arXiv:2412.01784)
  • Anthropic Sabotage Evaluations (2025)

Key findings:

  • Sandbagging representations are model-specific (cross-model transfer fails)
  • Detection is robust to prompt paraphrasing (85-90% accuracy)
  • False positive rates: 3-25% depending on model
  • Steering interventions can reduce sandbagging by ~20%

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rotalabs_probe-0.1.0.tar.gz (88.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rotalabs_probe-0.1.0-py3-none-any.whl (66.4 kB view details)

Uploaded Python 3

File details

Details for the file rotalabs_probe-0.1.0.tar.gz.

File metadata

  • Download URL: rotalabs_probe-0.1.0.tar.gz
  • Upload date:
  • Size: 88.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_probe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1cbce4dd2848e305235209beada2282f7f6c1910c384caefe3a9a131a7017f25
MD5 25b8b70f66f26e7477056b972496b307
BLAKE2b-256 0cb159636a16b669fbada725f954a40416d91fd57c420ce44033bfcd95a85404

See more details on using hashes here.

File details

Details for the file rotalabs_probe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rotalabs_probe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 66.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_probe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f8145804a91ce7705cef5bc7b02d0d982c37ab5ea01b979fec8ff5580c656e12
MD5 e1ceee25ee31645c628a57962aa85ea4
BLAKE2b-256 07fc1d8577afaafd02f4f3cedabb83f6096ef37e2925571d1687ecaed8ad16bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page