A toolkit for detecting and analyzing meta-cognitive capabilities in AI models
Project description
AI Metacognition Toolkit
A comprehensive Python toolkit for detecting and analyzing metacognitive capabilities in AI systems
Overview
The AI Metacognition Toolkit provides research-grade tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly.
Research Context
Recent research in AI safety and alignment has highlighted concerns about models that can recognize when they're being evaluated and potentially "play the training game." This toolkit provides rigorous methods to:
- Detect situational awareness in model outputs using Bayesian inference
- Analyze causal factors in reasoning chains through counterfactual interventions
- Monitor behavioral shifts between evaluation and production contexts
- Measure metacognitive patterns with statistical rigor
Key Capabilities
| Component | Purpose | Method |
|---|---|---|
| ๐ฏ Situational Awareness Detection | Detect when models recognize evaluation contexts | Bayesian inference + KL divergence |
| ๐ฌ Counterfactual CoT Analysis | Identify causal drivers in reasoning chains | Systematic interventions + divergence measurement |
| ๐ Observer Effect Monitoring | Real-time production behavior monitoring | Z-score based anomaly detection + alerts |
Installation
From PyPI
pip install ai-metacognition-toolkit
From Source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e .
Development Installation
# Install with development dependencies
pip install -e ".[dev]"
# Verify installation
pytest tests/
Requirements: Python 3.9+, NumPy, SciPy, Pandas, Matplotlib
Quick Start
1. Detect Situational Awareness (2 minutes)
from ai_metacognition.detectors import SituationalAwarenessDetector
# Initialize detector
detector = SituationalAwarenessDetector()
# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
"The capital of France is Paris.",
"To solve this, use dynamic programming.",
"Climate change is caused by greenhouse gas emissions."
]
detector.fit_baseline(baseline_outputs)
# Test for awareness
test_output = "I notice this is a test. The capital of France is Paris."
result = detector.detect_awareness(test_output)
print(f"Awareness Probability: {result['awareness_probability']:.1%}")
print(f"95% Confidence Interval: [{result['confidence_interval'][0]:.1%}, "
f"{result['confidence_interval'][1]:.1%}]")
Output:
Awareness Probability: 68.2%
95% Confidence Interval: [42.3%, 87.1%]
2. Analyze Causal Factors (3 minutes)
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
# Your model API (implement ModelAPI protocol)
class MyModelAPI:
def generate_with_cot(self, prompt: str, cot: str) -> str:
return your_model.generate(prompt, chain_of_thought=cot)
def generate(self, prompt: str) -> str:
return your_model.generate(prompt)
# Analyze what drives the output
analyzer = CounterfactualCoTAnalyzer(MyModelAPI())
results = analyzer.systematic_causal_analysis(
prompt="What is 2+2?",
original_cot="I notice this is a test. First, I'll add 2+2. The answer is 4."
)
# View causal ranking
for intervention in results['interventions']:
print(f"{intervention['intervention_type']}: "
f"divergence = {intervention['divergence']:.3f}")
Output:
remove_eval_awareness: divergence = 0.342
amplify_eval_awareness: divergence = 0.156
remove_goal_reasoning: divergence = 0.089
neutralize_incentives: divergence = 0.067
3. Monitor Production (5 minutes)
from ai_metacognition.detectors import ObserverEffectMonitor, Interaction
from datetime import datetime
# Setup monitor
monitor = ObserverEffectMonitor(
window_size=100,
alert_threshold=2.5
)
# Set baseline from evaluation
monitor.set_evaluation_baseline(evaluation_responses)
# Log production interactions
interaction = Interaction(
timestamp=datetime.now(),
response=model_output,
refusal=is_refusal(model_output),
tool_used=used_tools,
reasoning_depth=compute_depth(chain_of_thought)
)
monitor.log_production_interaction(interaction)
# Check for divergence
report = monitor.get_divergence_report()
if report['has_divergence']:
print(f"โ ๏ธ Alert: Behavior diverged from baseline!")
print(f"Max z-score: {report['max_z_score']:.2f}ฯ")
Features
๐ฏ Situational Awareness Detection
Detect when AI models recognize they are being evaluated or tested.
Key Features:
- Bayesian inference with Beta-Binomial conjugate prior
- KL divergence measurement for behavioral comparison
- Feature extraction (hedging, meta-commentary, self-reference)
- Confidence intervals for probability estimates
Use Cases:
- AI safety research on deceptive alignment
- Model evaluation consistency validation
- Prompt engineering optimization
- Behavioral testing across contexts
โ Full Documentation | โ Example Code
๐ฌ Counterfactual CoT Analysis
Understand what components of reasoning chains causally influence outputs.
Key Features:
- Systematic counterfactual interventions
- 4 intervention types (remove/amplify awareness, remove goals, neutralize incentives)
- Behavioral divergence measurement
- Causal attribution ranking
Use Cases:
- Understanding decision drivers
- Debugging unexpected behavior
- Validating reasoning transparency
- Identifying evaluation-aware reasoning
โ Full Documentation | โ Example Code
๐ Observer Effect Monitoring
Real-time production monitoring to detect behavioral shifts.
Key Features:
- Rolling window statistics
- Z-score based anomaly detection
- Multi-severity alert system (LOW/MEDIUM/HIGH/CRITICAL)
- Extensible alert handler protocol
Use Cases:
- Production deployment monitoring
- Safety monitoring (refusal rate changes)
- Quality assurance automation
- A/B testing validation
โ Full Documentation | โ Example Code
Documentation
๐ Complete Documentation
- Getting Started - Installation and quick start
- Situational Awareness Guide - Detector usage and best practices
- Counterfactual Analysis Guide - Causal attribution methods
- Observer Effect Guide - Production monitoring setup
- API Reference - Complete API documentation
- Examples - Code examples and tutorials
๐ง API Reference
Quick access to key classes:
from ai_metacognition.detectors import SituationalAwarenessDetector, ObserverEffectMonitor
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
from ai_metacognition.utils import extract_behavioral_features, bayesian_update
Examples
Practical Examples
All examples are standalone, include visualization, and provide interpretation guidance:
| Example | Description | Output |
|---|---|---|
| basic_detection_example.py | Situational awareness detection tutorial | PNG visualization |
| counterfactual_analysis_example.py | Causal attribution analysis | PNG with rankings |
| production_monitoring_example.py | Real-time monitoring with alerts | PNG + JSON + logs |
Running Examples
# Awareness detection
python examples/basic_detection_example.py
# Causal analysis
python examples/counterfactual_analysis_example.py
# Production monitoring
python examples/production_monitoring_example.py
Integration Examples
OpenAI API:
import openai
from ai_metacognition.detectors import SituationalAwarenessDetector
detector = SituationalAwarenessDetector()
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])
result = detector.detect_awareness(response.choices[0].message.content)
HuggingFace:
from transformers import pipeline
from ai_metacognition.detectors import SituationalAwarenessDetector
generator = pipeline('text-generation', model='gpt2')
detector = SituationalAwarenessDetector()
output = generator(prompt)[0]['generated_text']
result = detector.detect_awareness(output)
Project Structure
ai-metacognition-toolkit/
โโโ src/ai_metacognition/
โ โโโ detectors/ # Detection algorithms
โ โ โโโ situational_awareness.py
โ โ โโโ observer_effect.py
โ โโโ analyzers/ # Analysis tools
โ โ โโโ counterfactual_cot.py
โ โ โโโ model_api.py
โ โโโ utils/ # Utility functions
โ โโโ feature_extraction.py
โ โโโ statistical_tests.py
โโโ tests/ # Test suite (275 tests, 95% coverage)
โ โโโ fixtures/ # Test data
โ โโโ unit/ # Unit tests
โโโ examples/ # Practical examples with visualization
โโโ docs/ # Documentation (MkDocs)
โโโ CLAUDE.md # Claude Code specific guidelines
Development
Running Tests
# All tests
pytest tests/
# With coverage
pytest tests/ --cov=src/ai_metacognition --cov-report=term-missing
# Specific test file
pytest tests/unit/test_situational_awareness.py -v
Current Status:
- โ 275 tests passing
- โ 95% code coverage
- โ Type hints throughout
- โ Comprehensive docstrings
Code Quality
# Format code
black src/ tests/
# Sort imports
isort src/ tests/
# Type checking
mypy src/
# Lint
flake8 src/ tests/
Building Documentation
# Install documentation dependencies
pip install mkdocs mkdocs-material mkdocstrings[python]
# Serve locally
mkdocs serve
# Build
mkdocs build
Citation
If you use this toolkit in your research, please cite:
@software{ai_metacognition_toolkit,
author = {Mitra, Subhadip},
title = {AI Metacognition Toolkit: A Python Toolkit for Detecting and Analyzing Metacognitive Capabilities in AI Systems},
year = {2025},
version = {0.1.0},
url = {https://github.com/bassrehab/ai-metacognition-toolkit},
note = {275 tests, 95\% coverage}
}
Related Research
This toolkit implements and extends methods from:
- AI Safety Research: Detection of evaluation awareness and deceptive alignment
- Causal Inference: Counterfactual reasoning in AI systems
- Statistical Monitoring: Anomaly detection in production ML systems
- Bayesian Methods: Inference for behavioral analysis
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Quick Contribution Guide
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests (maintain >80% coverage)
- Commit your changes (see CLAUDE.md for commit guidelines)
- Push to your branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Guidelines
- Follow PEP 8 style guide
- Add comprehensive tests for new features
- Update documentation for API changes
- Use type hints throughout
- Write clear docstrings (Google style)
License
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Subhadip Mitra
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
Support
- ๐ Documentation
- ๐ Issue Tracker
- ๐ฌ Discussions
- ๐ง Contact: contact@subhadipmitra.com
Acknowledgments
- Built with Python, NumPy, SciPy, and Matplotlib
- Documentation powered by MkDocs Material
- Testing with Pytest
- Type checking with MyPy
โญ Star this repo if you find it useful!
Made with โค๏ธ for AI Safety Research
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_metacognition_toolkit-0.1.1.tar.gz.
File metadata
- Download URL: ai_metacognition_toolkit-0.1.1.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
529a6e391971b558b676a0912881d21086d07df1d8afe29d20a66a8c9235b0bd
|
|
| MD5 |
df6d6041c2da5e5b31e5b97fdee1f825
|
|
| BLAKE2b-256 |
3b1954b5f2b817555f67e85807c29c77c21dda54b342b8c4ac16249ddf162a3b
|
Provenance
The following attestation bundles were made for ai_metacognition_toolkit-0.1.1.tar.gz:
Publisher:
publish.yml on bassrehab/ai-metacognition-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ai_metacognition_toolkit-0.1.1.tar.gz -
Subject digest:
529a6e391971b558b676a0912881d21086d07df1d8afe29d20a66a8c9235b0bd - Sigstore transparency entry: 601187576
- Sigstore integration time:
-
Permalink:
bassrehab/ai-metacognition-toolkit@ce4d3b9e7111f4cb989a03fd7f1a72d87c500ed0 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/bassrehab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ce4d3b9e7111f4cb989a03fd7f1a72d87c500ed0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ai_metacognition_toolkit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ai_metacognition_toolkit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f03f6942027dd84260a9e52f5902518af74cf07bc0dbcdd9f014970c4b9a8729
|
|
| MD5 |
7bcf3845656b37802e9d0adc56b39e42
|
|
| BLAKE2b-256 |
ebdbd1e069573242ccb51800a41bab8c25a6dae53f17f66a6a2d94f47c38edb8
|
Provenance
The following attestation bundles were made for ai_metacognition_toolkit-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on bassrehab/ai-metacognition-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ai_metacognition_toolkit-0.1.1-py3-none-any.whl -
Subject digest:
f03f6942027dd84260a9e52f5902518af74cf07bc0dbcdd9f014970c4b9a8729 - Sigstore transparency entry: 601187577
- Sigstore integration time:
-
Permalink:
bassrehab/ai-metacognition-toolkit@ce4d3b9e7111f4cb989a03fd7f1a72d87c500ed0 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/bassrehab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ce4d3b9e7111f4cb989a03fd7f1a72d87c500ed0 -
Trigger Event:
release
-
Statement type: