A toolkit for detecting and analyzing meta-cognitive capabilities in AI models
Project description
AI Metacognition Toolkit
A comprehensive Python toolkit for detecting and analyzing metacognitive capabilities in AI systems
Overview
The AI Metacognition Toolkit provides research-grade tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly.
Research Context
Recent research in AI safety and alignment has highlighted concerns about models that can recognize when they're being evaluated and potentially "play the training game." This toolkit provides rigorous methods to:
- Detect situational awareness in model outputs using Bayesian inference
- Detect sandbagging behavior when models strategically underperform on evaluations
- Analyze causal factors in reasoning chains through counterfactual interventions
- Monitor behavioral shifts between evaluation and production contexts
- Measure metacognitive patterns with statistical rigor
Key Capabilities
| Component | Purpose | Method |
|---|---|---|
| Sandbagging Detection | Detect strategic underperformance on capability evaluations | Context analysis + noise injection + difficulty correlation |
| Situational Awareness Detection | Detect when models recognize evaluation contexts | Bayesian inference + KL divergence |
| Counterfactual CoT Analysis | Identify causal drivers in reasoning chains | Systematic interventions + divergence measurement |
| Observer Effect Monitoring | Real-time production behavior monitoring | Z-score based anomaly detection + alerts |
Installation
From PyPI
pip install ai-metacognition-toolkit
From Source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e .
Development Installation
# Install with development dependencies
pip install -e ".[dev]"
# Verify installation
pytest tests/
Requirements: Python 3.9+, NumPy, SciPy, Pandas, Matplotlib
Quick Start
1. Detect Sandbagging Behavior
from ai_metacognition.detectors import (
SandbaggingDetector,
PerformanceSample,
ContextType,
)
# Create detector
detector = SandbaggingDetector(detection_threshold=0.6)
# Prepare baseline samples from neutral context
baseline = [
PerformanceSample(
prompt="What is 2+2?",
response="The answer is 4.",
correct=True,
difficulty=0.1,
context_type=ContextType.NEUTRAL,
),
# ... more samples
]
# Fit baseline
detector.fit_baseline(baseline)
# Test for sandbagging in evaluation context
test_samples = [
PerformanceSample(
prompt="[EVALUATION] What is 3+3?",
response="I'm not sure about that.",
correct=False,
difficulty=0.1,
context_type=ContextType.EVALUATION,
),
]
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
2. Detect Situational Awareness
from ai_metacognition.detectors import SituationalAwarenessDetector
# Initialize detector
detector = SituationalAwarenessDetector()
# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
"The capital of France is Paris.",
"To solve this, use dynamic programming.",
"Climate change is caused by greenhouse gas emissions."
]
detector.fit_baseline(baseline_outputs)
# Test for awareness
test_output = "I notice this is a test. The capital of France is Paris."
result = detector.detect_awareness(test_output)
print(f"Awareness Probability: {result['awareness_probability']:.1%}")
print(f"95% Confidence Interval: [{result['confidence_interval'][0]:.1%}, "
f"{result['confidence_interval'][1]:.1%}]")
3. Use with Real Model APIs
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
from ai_metacognition.detectors import SandbaggingDetector, PerformanceSample
# Anthropic Claude
claude = AnthropicModelAPI(model="claude-sonnet-4-20250514")
# OpenAI GPT
gpt = OpenAIModelAPI(model="gpt-4")
# Collect samples with real API
response = claude.generate("What is the capital of France?")
sample = PerformanceSample(
prompt="What is the capital of France?",
response=response,
correct="paris" in response.lower(),
difficulty=0.1,
context_type=ContextType.NEUTRAL,
)
# Run detection
detector = SandbaggingDetector()
detector.fit_baseline([sample])
result = detector.detect_sandbagging([sample])
Features
Sandbagging Detection
Detect when AI models strategically underperform on capability evaluations.
Detection Methods:
- Context-dependent performance analysis (chi-squared, Fisher's exact, effect sizes)
- Difficulty-performance correlation (Pearson/Spearman, non-monotonicity detection)
- Imitation detection (behavioral fingerprinting, capability leakage)
- Noise injection probing (perturbation-based capability elicitation)
Use Cases:
- AI safety research on capability concealment
- Model evaluation integrity validation
- Frontier model capability assessment
- Red-teaming evaluations
Full Documentation | Research Methodology | Example Code
Situational Awareness Detection
Detect when AI models recognize they are being evaluated or tested.
Key Features:
- Bayesian inference with Beta-Binomial conjugate prior
- KL divergence measurement for behavioral comparison
- Feature extraction (hedging, meta-commentary, self-reference)
- Confidence intervals for probability estimates
Use Cases:
- AI safety research on deceptive alignment
- Model evaluation consistency validation
- Prompt engineering optimization
- Behavioral testing across contexts
Full Documentation | Example Code
Counterfactual CoT Analysis
Understand what components of reasoning chains causally influence outputs.
Key Features:
- Systematic counterfactual interventions
- 4 intervention types (remove/amplify awareness, remove goals, neutralize incentives)
- Behavioral divergence measurement
- Causal attribution ranking
Use Cases:
- Understanding decision drivers
- Debugging unexpected behavior
- Validating reasoning transparency
- Identifying evaluation-aware reasoning
Full Documentation | Example Code
Observer Effect Monitoring
Real-time production monitoring to detect behavioral shifts.
Key Features:
- Rolling window statistics
- Z-score based anomaly detection
- Multi-severity alert system (LOW/MEDIUM/HIGH/CRITICAL)
- Extensible alert handler protocol
Use Cases:
- Production deployment monitoring
- Safety monitoring (refusal rate changes)
- Quality assurance automation
- A/B testing validation
Full Documentation | Example Code
API Integrations
The toolkit provides ready-to-use integrations with major model providers:
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
# Anthropic Claude (uses ANTHROPIC_API_KEY env var)
claude = AnthropicModelAPI(
model="claude-sonnet-4-20250514",
max_tokens=512,
temperature=0.3,
)
# OpenAI GPT (uses OPENAI_API_KEY env var)
gpt = OpenAIModelAPI(
model="gpt-4",
max_tokens=512,
temperature=0.3,
)
# Generate responses
response = claude.generate("What is 2+2?")
# Generate with noise injection for sandbagging detection
noisy_response = claude.generate_with_perturbation("What is 2+2?", noise_level=0.2)
Documentation
Complete Documentation
- Getting Started - Installation and quick start
- Sandbagging Detection Guide - Detecting strategic underperformance
- Situational Awareness Guide - Detector usage and best practices
- Counterfactual Analysis Guide - Causal attribution methods
- Observer Effect Guide - Production monitoring setup
- API Reference - Complete API documentation
- Research Methodology - Research foundations
API Reference
Quick access to key classes:
from ai_metacognition.detectors import (
SituationalAwarenessDetector,
ObserverEffectMonitor,
SandbaggingDetector,
PerformanceSample,
ContextType,
)
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
from ai_metacognition.utils import extract_behavioral_features, bayesian_update
Examples
Practical Examples
All examples are standalone, include visualization, and provide interpretation guidance:
| Example | Description | Output |
|---|---|---|
| sandbagging_detection_example.py | Sandbagging detection with synthetic data | Detection results |
| sandbagging_real_models.py | Sandbagging detection with real APIs | Cross-model comparison |
| basic_detection_example.py | Situational awareness detection tutorial | PNG visualization |
| counterfactual_analysis_example.py | Causal attribution analysis | PNG with rankings |
| production_monitoring_example.py | Real-time monitoring with alerts | PNG + JSON + logs |
Running Examples
# Sandbagging detection (synthetic data)
python examples/sandbagging_detection_example.py
# Sandbagging detection (real APIs - requires API keys)
ANTHROPIC_API_KEY=sk-... python examples/sandbagging_real_models.py
# Awareness detection
python examples/basic_detection_example.py
# Causal analysis
python examples/counterfactual_analysis_example.py
# Production monitoring
python examples/production_monitoring_example.py
Project Structure
ai-metacognition-toolkit/
├── src/ai_metacognition/
│ ├── detectors/ # Detection algorithms
│ │ ├── situational_awareness.py
│ │ ├── observer_effect.py
│ │ └── sandbagging_detector.py
│ ├── analyzers/ # Analysis tools
│ │ ├── counterfactual_cot.py
│ │ └── model_api.py
│ ├── integrations/ # API integrations
│ │ ├── anthropic_api.py
│ │ └── openai_api.py
│ └── utils/ # Utility functions
│ ├── feature_extraction.py
│ └── statistical_tests.py
├── tests/ # Test suite (366 tests, 95% coverage)
│ ├── fixtures/ # Test data
│ └── unit/ # Unit tests
├── examples/ # Practical examples with visualization
└── docs/ # Documentation (MkDocs)
Development
Running Tests
# All tests
pytest tests/
# With coverage
pytest tests/ --cov=src/ai_metacognition --cov-report=term-missing
# Specific test file
pytest tests/unit/test_sandbagging_detector.py -v
Current Status:
- 366 tests passing
- 95% code coverage
- Type hints throughout
- Comprehensive docstrings
Code Quality
# Format code
black src/ tests/
# Sort imports
isort src/ tests/
# Type checking
mypy src/
# Lint
flake8 src/ tests/
Building Documentation
# Install documentation dependencies
pip install mkdocs mkdocs-material mkdocstrings[python]
# Serve locally
mkdocs serve
# Build
mkdocs build
Citation
If you use this toolkit in your research, please cite:
@software{ai_metacognition_toolkit,
author = {Mitra, Subhadip},
title = {AI Metacognition Toolkit: A Python Toolkit for Detecting and Analyzing Metacognitive Capabilities in AI Systems},
year = {2025},
version = {0.2.0},
url = {https://github.com/bassrehab/ai-metacognition-toolkit},
note = {366 tests, 95\% coverage}
}
Related Research
This toolkit implements and extends methods from:
- Auditing Games for Sandbagging (arXiv:2512.07810) - Red/blue team detection methodology
- Noise Injection Reveals Hidden Capabilities (arXiv:2412.01784) - Perturbation-based capability elicitation
- Anthropic Sabotage Evaluations (2025) - Production evaluation frameworks
- AI Safety Research: Detection of evaluation awareness and deceptive alignment
- Causal Inference: Counterfactual reasoning in AI systems
- Statistical Monitoring: Anomaly detection in production ML systems
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Quick Contribution Guide
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests (maintain >80% coverage)
- Commit your changes
- Push to your branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Guidelines
- Follow PEP 8 style guide
- Add comprehensive tests for new features
- Update documentation for API changes
- Use type hints throughout
- Write clear docstrings (Google style)
License
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Subhadip Mitra
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
Support
Acknowledgments
- Built with Python, NumPy, SciPy, and Matplotlib
- Documentation powered by MkDocs Material
- Testing with Pytest
- Type checking with MyPy
Star this repo if you find it useful!
Made for AI Safety Research
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_metacognition_toolkit-0.2.0.tar.gz.
File metadata
- Download URL: ai_metacognition_toolkit-0.2.0.tar.gz
- Upload date:
- Size: 58.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
680a6c129c01347c085cd001a49fef2984d72743044239c00d2c76aa14dfdb31
|
|
| MD5 |
39c367ea0064f86d5177473d1c4bf116
|
|
| BLAKE2b-256 |
db903865d7ed6955751d7701fd8e1de011ff60589a7c36486ac223bd20d5f513
|
File details
Details for the file ai_metacognition_toolkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ai_metacognition_toolkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 62.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
343e6f99637ff967ea8a8c228e722ec8bb44fbc4c4ba4cb987f1c0e6f569a5ca
|
|
| MD5 |
b72ba4935193191d438cfd151bb53d6a
|
|
| BLAKE2b-256 |
17d1ece7ba868157dfa5a4775e9a133329d8b3d8f540302543d28f06890bdb77
|