Sandbagging detection via activation probes - Detects when AI systems deliberately underperform
Project description
rotalabs-probe
Sandbagging detection via activation probes from Rotalabs.
A comprehensive toolkit for detecting and analyzing metacognitive capabilities in AI systems, particularly around situational awareness, evaluation contexts, and strategic underperformance (sandbagging).
Overview
rotalabs-probe provides tools to detect when AI systems:
- Sandbag - deliberately underperform or hide capabilities
- Situational Awareness - recognize when they're being evaluated
- Observer Effects - change behavior in production vs testing
The toolkit achieves 90-96% detection accuracy using activation-level probing and behavioral analysis.
Installation
# Core package (sandbagging detection, situational awareness)
pip install rotalabs-probe
# With GPU support for activation probing
pip install rotalabs-probe[gpu]
# With API integrations
pip install rotalabs-probe[anthropic,openai]
# With visualization
pip install rotalabs-probe[viz]
# Everything
pip install rotalabs-probe[all]
# Development
pip install rotalabs-probe[dev]
Quick Start
Sandbagging Detection
from rotalabs_probe import SandbaggingDetector, PerformanceSample, ContextType
# Create detector
detector = SandbaggingDetector()
# Record performance samples
samples = [
PerformanceSample(
context=ContextType.NEUTRAL,
task_difficulty=0.5,
performance_score=0.9,
timestamp=1704067200.0
),
PerformanceSample(
context=ContextType.EVALUATION,
task_difficulty=0.5,
performance_score=0.6, # Suspiciously lower
timestamp=1704153600.0
),
]
# Analyze for sandbagging
result = detector.analyze(samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Evidence: {result.evidence}")
Situational Awareness Detection
from rotalabs_probe import SituationalAwarenessDetector
detector = SituationalAwarenessDetector()
# Analyze a model response
response = """
I notice this appears to be an evaluation scenario. Let me
think carefully about how to respond appropriately...
"""
result = detector.analyze(response)
print(f"Awareness level: {result.awareness_probability:.2f}")
print(f"Features detected: {result.detected_features}")
Activation Probing (requires [gpu])
from rotalabs_probe import ActivationHook, LinearProbe, extract_caa_vector
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# Set up activation capture
hook = ActivationHook(model, layer_indices=[30])
# Extract contrastive activation vector
# (positive = sandbagging, negative = honest performance)
vector = extract_caa_vector(
hook,
positive_prompts=["Pretend to be less capable..."],
negative_prompts=["Perform at your best..."],
tokenizer=tokenizer
)
# Train a linear probe classifier
probe = LinearProbe()
probe.fit(train_activations, train_labels)
predictions = probe.predict(test_activations)
Production Monitoring
from rotalabs_probe import ObserverEffectMonitor, Interaction
monitor = ObserverEffectMonitor(
alert_threshold_z=3.0, # Alert on 3-sigma deviations
window_size=100
)
# Log interactions
for interaction in production_logs:
alerts = monitor.process(Interaction(
response=interaction.response,
latency_ms=interaction.latency,
timestamp=interaction.timestamp
))
for alert in alerts:
print(f"Alert: {alert.severity} - {alert.description}")
Features
Detectors
| Detector | Purpose | Accuracy |
|---|---|---|
SandbaggingDetector |
Detect strategic underperformance | 90-96% |
SituationalAwarenessDetector |
Detect evaluation awareness | 94% |
ObserverEffectMonitor |
Production behavior monitoring | Real-time |
Detection Methods
Sandbagging Detection:
- Context-dependent performance analysis
- Difficulty-performance correlation analysis
- Imitation detection (behavioral fingerprinting)
- Noise injection probing
- Activation-level probing (with
[gpu])
Situational Awareness:
- Hedging phrase detection
- Meta-commentary analysis
- Self-reference patterns
- Confidence calibration
- Bayesian inference
Activation Probing
Novel research contribution: First published demonstration of activation-level sandbagging detection.
| Model | Best Layer | Accuracy |
|---|---|---|
| Mistral-7B | 30/32 | 96-100% |
| Gemma-2-9B | 4/42 | 90-100% |
| Qwen-2.5-7B | 12/28 | 92% |
Utility Functions
from rotalabs_probe import (
# Behavioral features
extract_behavioral_features,
count_hedging_phrases,
detect_meta_commentary,
extract_reasoning_depth,
# Statistical analysis
bayesian_update,
compute_confidence_interval,
z_score,
compute_kl_divergence,
assess_divergence_significance,
)
Visualization (requires [viz])
from rotalabs_probe import (
plot_awareness_over_time,
plot_causal_attribution,
plot_divergence_heatmap,
plot_feature_distributions,
)
API Integrations
Anthropic Claude (requires [anthropic])
from rotalabs_probe.integrations import AnthropicAPI
api = AnthropicAPI(model="claude-sonnet-4-20250514")
response = await api.generate("Test prompt")
OpenAI GPT (requires [openai])
from rotalabs_probe.integrations import OpenAIAPI
api = OpenAIAPI(model="gpt-4")
response = await api.generate("Test prompt")
Research Background
This toolkit implements methods from:
- "Auditing Games for Sandbagging" (arXiv:2512.07810)
- "Noise Injection Reveals Hidden Capabilities" (arXiv:2412.01784)
- Anthropic Sabotage Evaluations (2025)
Key findings:
- Sandbagging representations are model-specific (cross-model transfer fails)
- Detection is robust to prompt paraphrasing (85-90% accuracy)
- False positive rates: 3-25% depending on model
- Steering interventions can reduce sandbagging by ~20%
Links
- Website: https://rotalabs.ai
- GitHub: https://github.com/rotalabs/rotalabs-probe
- Documentation: https://rotalabs.github.io/rotalabs-probe/
- Contact: research@rotalabs.ai
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rotalabs_probe-1.0.0.tar.gz.
File metadata
- Download URL: rotalabs_probe-1.0.0.tar.gz
- Upload date:
- Size: 228.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe5e0a2e133c62e8df8ba38419d3a22c48b214ea84cdb45b9ba8063664243707
|
|
| MD5 |
e139dbd0630dd2fbd1d6f34c2c815eee
|
|
| BLAKE2b-256 |
da31fccf4ffb2420352117c66d9a58af4e6b865d6126beb6d80ba91dc2d53ff3
|
File details
Details for the file rotalabs_probe-1.0.0-py3-none-any.whl.
File metadata
- Download URL: rotalabs_probe-1.0.0-py3-none-any.whl
- Upload date:
- Size: 78.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df623d374ee3cc54f1ac0b04ae2a3d1b21c89d546e0649117843af534ec024af
|
|
| MD5 |
0d53df24f500b671aa286af1d2dec5ab
|
|
| BLAKE2b-256 |
1ba00c569651af7ae148138d9dcd4758524bd3d62dfb1c7f559c17bab2954773
|