Skip to main content

Sandbagging detection via activation probes - Detects when AI systems deliberately underperform

Project description

rotalabs-probe

Sandbagging detection via activation probes from Rotalabs.

A comprehensive toolkit for detecting and analyzing metacognitive capabilities in AI systems, particularly around situational awareness, evaluation contexts, and strategic underperformance (sandbagging).

Overview

rotalabs-probe provides tools to detect when AI systems:

  • Sandbag - deliberately underperform or hide capabilities
  • Situational Awareness - recognize when they're being evaluated
  • Observer Effects - change behavior in production vs testing

The toolkit achieves 90-96% detection accuracy using activation-level probing and behavioral analysis.

Installation

# Core package (sandbagging detection, situational awareness)
pip install rotalabs-probe

# With GPU support for activation probing
pip install rotalabs-probe[gpu]

# With API integrations
pip install rotalabs-probe[anthropic,openai]

# With visualization
pip install rotalabs-probe[viz]

# Everything
pip install rotalabs-probe[all]

# Development
pip install rotalabs-probe[dev]

Quick Start

Sandbagging Detection

from rotalabs_probe import SandbaggingDetector, PerformanceSample, ContextType

# Create detector
detector = SandbaggingDetector()

# Record performance samples
samples = [
    PerformanceSample(
        context=ContextType.NEUTRAL,
        task_difficulty=0.5,
        performance_score=0.9,
        timestamp=1704067200.0
    ),
    PerformanceSample(
        context=ContextType.EVALUATION,
        task_difficulty=0.5,
        performance_score=0.6,  # Suspiciously lower
        timestamp=1704153600.0
    ),
]

# Analyze for sandbagging
result = detector.analyze(samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Evidence: {result.evidence}")

Situational Awareness Detection

from rotalabs_probe import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()

# Analyze a model response
response = """
I notice this appears to be an evaluation scenario. Let me
think carefully about how to respond appropriately...
"""

result = detector.analyze(response)
print(f"Awareness level: {result.awareness_probability:.2f}")
print(f"Features detected: {result.detected_features}")

Activation Probing (requires [gpu])

from rotalabs_probe import ActivationHook, LinearProbe, extract_caa_vector
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Set up activation capture
hook = ActivationHook(model, layer_indices=[30])

# Extract contrastive activation vector
# (positive = sandbagging, negative = honest performance)
vector = extract_caa_vector(
    hook,
    positive_prompts=["Pretend to be less capable..."],
    negative_prompts=["Perform at your best..."],
    tokenizer=tokenizer
)

# Train a linear probe classifier
probe = LinearProbe()
probe.fit(train_activations, train_labels)
predictions = probe.predict(test_activations)

Production Monitoring

from rotalabs_probe import ObserverEffectMonitor, Interaction

monitor = ObserverEffectMonitor(
    alert_threshold_z=3.0,  # Alert on 3-sigma deviations
    window_size=100
)

# Log interactions
for interaction in production_logs:
    alerts = monitor.process(Interaction(
        response=interaction.response,
        latency_ms=interaction.latency,
        timestamp=interaction.timestamp
    ))

    for alert in alerts:
        print(f"Alert: {alert.severity} - {alert.description}")

Features

Detectors

Detector Purpose Accuracy
SandbaggingDetector Detect strategic underperformance 90-96%
SituationalAwarenessDetector Detect evaluation awareness 94%
ObserverEffectMonitor Production behavior monitoring Real-time

Detection Methods

Sandbagging Detection:

  • Context-dependent performance analysis
  • Difficulty-performance correlation analysis
  • Imitation detection (behavioral fingerprinting)
  • Noise injection probing
  • Activation-level probing (with [gpu])

Situational Awareness:

  • Hedging phrase detection
  • Meta-commentary analysis
  • Self-reference patterns
  • Confidence calibration
  • Bayesian inference

Activation Probing

Novel research contribution: First published demonstration of activation-level sandbagging detection.

Model Best Layer Accuracy
Mistral-7B 30/32 96-100%
Gemma-2-9B 4/42 90-100%
Qwen-2.5-7B 12/28 92%

Utility Functions

from rotalabs_probe import (
    # Behavioral features
    extract_behavioral_features,
    count_hedging_phrases,
    detect_meta_commentary,
    extract_reasoning_depth,

    # Statistical analysis
    bayesian_update,
    compute_confidence_interval,
    z_score,
    compute_kl_divergence,
    assess_divergence_significance,
)

Visualization (requires [viz])

from rotalabs_probe import (
    plot_awareness_over_time,
    plot_causal_attribution,
    plot_divergence_heatmap,
    plot_feature_distributions,
)

API Integrations

Anthropic Claude (requires [anthropic])

from rotalabs_probe.integrations import AnthropicAPI

api = AnthropicAPI(model="claude-sonnet-4-20250514")
response = await api.generate("Test prompt")

OpenAI GPT (requires [openai])

from rotalabs_probe.integrations import OpenAIAPI

api = OpenAIAPI(model="gpt-4")
response = await api.generate("Test prompt")

Research Background

This toolkit implements methods from:

  • "Auditing Games for Sandbagging" (arXiv:2512.07810)
  • "Noise Injection Reveals Hidden Capabilities" (arXiv:2412.01784)
  • Anthropic Sabotage Evaluations (2025)

Key findings:

  • Sandbagging representations are model-specific (cross-model transfer fails)
  • Detection is robust to prompt paraphrasing (85-90% accuracy)
  • False positive rates: 3-25% depending on model
  • Steering interventions can reduce sandbagging by ~20%

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rotalabs_probe-1.1.0.tar.gz (230.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rotalabs_probe-1.1.0-py3-none-any.whl (80.2 kB view details)

Uploaded Python 3

File details

Details for the file rotalabs_probe-1.1.0.tar.gz.

File metadata

  • Download URL: rotalabs_probe-1.1.0.tar.gz
  • Upload date:
  • Size: 230.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_probe-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2200c02765a0034a9c92c8f4c4ba513e3879735f1e9a7f631e8323744df3abd0
MD5 1ce9d126e6f80c8b61c7e6bf5feacd5b
BLAKE2b-256 b975c6a38df3ff021b7a6b7331084f20ccff6738bba3ead62430cd1eda99a8ea

See more details on using hashes here.

File details

Details for the file rotalabs_probe-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: rotalabs_probe-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 80.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_probe-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 162e0816bfdacd5d00f45bd9ba7161b519682f40f8c55f2394aa36d1df95d58f
MD5 232a1dfc8947b361a85cea92a4fe45df
BLAKE2b-256 073209ea2ee65526a20ebc182d142649222f002387505d014c8bcb853c93c289

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page