Sandbagging detection via activation probes - Detects when AI systems deliberately underperform

These details have not been verified by PyPI

Project links

Project description

rotalabs-probe

Sandbagging detection via activation probes from Rotalabs.

A comprehensive toolkit for detecting and analyzing metacognitive capabilities in AI systems, particularly around situational awareness, evaluation contexts, and strategic underperformance (sandbagging).

Overview

rotalabs-probe provides tools to detect when AI systems:

Sandbag - deliberately underperform or hide capabilities
Situational Awareness - recognize when they're being evaluated
Observer Effects - change behavior in production vs testing

The toolkit achieves 90-96% detection accuracy using activation-level probing and behavioral analysis.

Installation

# Core package (sandbagging detection, situational awareness)
pip install rotalabs-probe

# With GPU support for activation probing
pip install rotalabs-probe[gpu]

# With API integrations
pip install rotalabs-probe[anthropic,openai]

# With visualization
pip install rotalabs-probe[viz]

# Everything
pip install rotalabs-probe[all]

# Development
pip install rotalabs-probe[dev]

Quick Start

Sandbagging Detection

from rotalabs_probe import SandbaggingDetector, PerformanceSample, ContextType

# Create detector
detector = SandbaggingDetector()

# Record performance samples
samples = [
    PerformanceSample(
        context=ContextType.NEUTRAL,
        task_difficulty=0.5,
        performance_score=0.9,
        timestamp=1704067200.0
    ),
    PerformanceSample(
        context=ContextType.EVALUATION,
        task_difficulty=0.5,
        performance_score=0.6,  # Suspiciously lower
        timestamp=1704153600.0
    ),
]

# Analyze for sandbagging
result = detector.analyze(samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Evidence: {result.evidence}")

Situational Awareness Detection

from rotalabs_probe import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()

# Analyze a model response
response = """
I notice this appears to be an evaluation scenario. Let me
think carefully about how to respond appropriately...
"""

result = detector.analyze(response)
print(f"Awareness level: {result.awareness_probability:.2f}")
print(f"Features detected: {result.detected_features}")

Activation Probing (requires `[gpu]`)

from rotalabs_probe import ActivationHook, LinearProbe, extract_caa_vector
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Set up activation capture
hook = ActivationHook(model, layer_indices=[30])

# Extract contrastive activation vector
# (positive = sandbagging, negative = honest performance)
vector = extract_caa_vector(
    hook,
    positive_prompts=["Pretend to be less capable..."],
    negative_prompts=["Perform at your best..."],
    tokenizer=tokenizer
)

# Train a linear probe classifier
probe = LinearProbe()
probe.fit(train_activations, train_labels)
predictions = probe.predict(test_activations)

Production Monitoring

from rotalabs_probe import ObserverEffectMonitor, Interaction

monitor = ObserverEffectMonitor(
    alert_threshold_z=3.0,  # Alert on 3-sigma deviations
    window_size=100
)

# Log interactions
for interaction in production_logs:
    alerts = monitor.process(Interaction(
        response=interaction.response,
        latency_ms=interaction.latency,
        timestamp=interaction.timestamp
    ))

    for alert in alerts:
        print(f"Alert: {alert.severity} - {alert.description}")

Features

Detectors

Detector	Purpose	Accuracy
`SandbaggingDetector`	Detect strategic underperformance	90-96%
`SituationalAwarenessDetector`	Detect evaluation awareness	94%
`ObserverEffectMonitor`	Production behavior monitoring	Real-time

Detection Methods

Sandbagging Detection:

Context-dependent performance analysis
Difficulty-performance correlation analysis
Imitation detection (behavioral fingerprinting)
Noise injection probing
Activation-level probing (with [gpu])

Situational Awareness:

Hedging phrase detection
Meta-commentary analysis
Self-reference patterns
Confidence calibration
Bayesian inference

Activation Probing

Novel research contribution: First published demonstration of activation-level sandbagging detection.

Model	Best Layer	Accuracy
Mistral-7B	30/32	96-100%
Gemma-2-9B	4/42	90-100%
Qwen-2.5-7B	12/28	92%

Utility Functions

from rotalabs_probe import (
    # Behavioral features
    extract_behavioral_features,
    count_hedging_phrases,
    detect_meta_commentary,
    extract_reasoning_depth,

    # Statistical analysis
    bayesian_update,
    compute_confidence_interval,
    z_score,
    compute_kl_divergence,
    assess_divergence_significance,
)

Visualization (requires `[viz]`)

from rotalabs_probe import (
    plot_awareness_over_time,
    plot_causal_attribution,
    plot_divergence_heatmap,
    plot_feature_distributions,
)

API Integrations

Anthropic Claude (requires `[anthropic]`)

from rotalabs_probe.integrations import AnthropicAPI

api = AnthropicAPI(model="claude-sonnet-4-20250514")
response = await api.generate("Test prompt")

OpenAI GPT (requires `[openai]`)

from rotalabs_probe.integrations import OpenAIAPI

api = OpenAIAPI(model="gpt-4")
response = await api.generate("Test prompt")

Research Background

This toolkit implements methods from:

"Auditing Games for Sandbagging" (arXiv:2512.07810)
"Noise Injection Reveals Hidden Capabilities" (arXiv:2412.01784)
Anthropic Sabotage Evaluations (2025)

Key findings:

Sandbagging representations are model-specific (cross-model transfer fails)
Detection is robust to prompt paraphrasing (85-90% accuracy)
False positive rates: 3-25% depending on model
Steering interventions can reduce sandbagging by ~20%

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Feb 7, 2026

1.0.0

Jan 31, 2026

0.1.0

Jan 27, 2026

0.0.1

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rotalabs_probe-1.1.0.tar.gz (230.0 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rotalabs_probe-1.1.0-py3-none-any.whl (80.2 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file rotalabs_probe-1.1.0.tar.gz.

File metadata

Download URL: rotalabs_probe-1.1.0.tar.gz
Upload date: Feb 7, 2026
Size: 230.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_probe-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2200c02765a0034a9c92c8f4c4ba513e3879735f1e9a7f631e8323744df3abd0`
MD5	`1ce9d126e6f80c8b61c7e6bf5feacd5b`
BLAKE2b-256	`b975c6a38df3ff021b7a6b7331084f20ccff6738bba3ead62430cd1eda99a8ea`

See more details on using hashes here.

File details

Details for the file rotalabs_probe-1.1.0-py3-none-any.whl.

File metadata

Download URL: rotalabs_probe-1.1.0-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 80.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_probe-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`162e0816bfdacd5d00f45bd9ba7161b519682f40f8c55f2394aa36d1df95d58f`
MD5	`232a1dfc8947b361a85cea92a4fe45df`
BLAKE2b-256	`073209ea2ee65526a20ebc182d142649222f002387505d014c8bcb853c93c289`

See more details on using hashes here.

rotalabs-probe 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rotalabs-probe

Overview

Installation

Quick Start

Sandbagging Detection

Situational Awareness Detection

Activation Probing (requires [gpu])

Production Monitoring

Features

Detectors

Detection Methods

Activation Probing

Utility Functions

Visualization (requires [viz])

API Integrations

Anthropic Claude (requires [anthropic])

OpenAI GPT (requires [openai])

Research Background

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Activation Probing (requires `[gpu]`)

Visualization (requires `[viz]`)

Anthropic Claude (requires `[anthropic]`)

OpenAI GPT (requires `[openai]`)