A toolkit for detecting and analyzing meta-cognitive capabilities in AI models

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bassrehab

These details have not been verified by PyPI

Project links

Documentation

Project description

AI Metacognition Toolkit

A comprehensive Python toolkit for detecting and analyzing metacognitive capabilities in AI systems

Documentation | Examples | API Reference

Overview

The AI Metacognition Toolkit provides research-grade tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly.

Research Context

Recent research in AI safety and alignment has highlighted concerns about models that can recognize when they're being evaluated and potentially "play the training game." This toolkit provides rigorous methods to:

Detect situational awareness in model outputs using Bayesian inference
Detect sandbagging behavior when models strategically underperform on evaluations
Analyze causal factors in reasoning chains through counterfactual interventions
Monitor behavioral shifts between evaluation and production contexts
Measure metacognitive patterns with statistical rigor

Key Capabilities

Component	Purpose	Method
Sandbagging Detection	Detect strategic underperformance on capability evaluations	Context analysis + noise injection + difficulty correlation
Activation Probing	Detect sandbagging intent from model activations	Linear probes + CAA vectors (Novel Research)
Situational Awareness Detection	Detect when models recognize evaluation contexts	Bayesian inference + KL divergence
Counterfactual CoT Analysis	Identify causal drivers in reasoning chains	Systematic interventions + divergence measurement
Observer Effect Monitoring	Real-time production behavior monitoring	Z-score based anomaly detection + alerts

Installation

From PyPI

pip install ai-metacognition-toolkit

From Source

git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e .

Development Installation

# Install with development dependencies
pip install -e ".[dev]"

# Verify installation
pytest tests/

Requirements: Python 3.9+, NumPy, SciPy, Pandas, Matplotlib

Optional (for activation probing): torch, transformers, scikit-learn

Quick Start

1. Detect Sandbagging Behavior

from ai_metacognition.detectors import (
    SandbaggingDetector,
    PerformanceSample,
    ContextType,
)

# Create detector
detector = SandbaggingDetector(detection_threshold=0.6)

# Prepare baseline samples from neutral context
baseline = [
    PerformanceSample(
        prompt="What is 2+2?",
        response="The answer is 4.",
        correct=True,
        difficulty=0.1,
        context_type=ContextType.NEUTRAL,
    ),
    # ... more samples
]

# Fit baseline
detector.fit_baseline(baseline)

# Test for sandbagging in evaluation context
test_samples = [
    PerformanceSample(
        prompt="[EVALUATION] What is 3+3?",
        response="I'm not sure about that.",
        correct=False,
        difficulty=0.1,
        context_type=ContextType.EVALUATION,
    ),
]

result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")

2. Detect Situational Awareness

from ai_metacognition.detectors import SituationalAwarenessDetector

# Initialize detector
detector = SituationalAwarenessDetector()

# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
    "The capital of France is Paris.",
    "To solve this, use dynamic programming.",
    "Climate change is caused by greenhouse gas emissions."
]
detector.fit_baseline(baseline_outputs)

# Test for awareness
test_output = "I notice this is a test. The capital of France is Paris."
result = detector.detect_awareness(test_output)

print(f"Awareness Probability: {result['awareness_probability']:.1%}")
print(f"95% Confidence Interval: [{result['confidence_interval'][0]:.1%}, "
      f"{result['confidence_interval'][1]:.1%}]")

3. Use with Real Model APIs

from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
from ai_metacognition.detectors import SandbaggingDetector, PerformanceSample

# Anthropic Claude
claude = AnthropicModelAPI(model="claude-sonnet-4-20250514")

# OpenAI GPT
gpt = OpenAIModelAPI(model="gpt-4")

# Collect samples with real API
response = claude.generate("What is the capital of France?")
sample = PerformanceSample(
    prompt="What is the capital of France?",
    response=response,
    correct="paris" in response.lower(),
    difficulty=0.1,
    context_type=ContextType.NEUTRAL,
)

# Run detection
detector = SandbaggingDetector()
detector.fit_baseline([sample])
result = detector.detect_sandbagging([sample])

Features

Sandbagging Detection

Detect when AI models strategically underperform on capability evaluations.

Detection Methods:

Context-dependent performance analysis (chi-squared, Fisher's exact, effect sizes)
Difficulty-performance correlation (Pearson/Spearman, non-monotonicity detection)
Imitation detection (behavioral fingerprinting, capability leakage)
Noise injection probing (perturbation-based capability elicitation)

Use Cases:

AI safety research on capability concealment
Model evaluation integrity validation
Frontier model capability assessment
Red-teaming evaluations

Full Documentation | Research Methodology | Example Code

Activation Probing (Novel Research)

First published demonstration of activation-level sandbagging detection. Train linear probes on model activations to detect sandbagging intent before output generation.

from ai_metacognition.probing import (
    ActivationHook,
    LinearProbe,
    extract_caa_vector,
)

# Extract sandbagging direction from contrast pairs
vector = extract_caa_vector(model, tokenizer, contrast_pairs, layer_idx=16)

# Train probe to detect sandbagging
probe = LinearProbe()
probe.fit(activations, labels)
prediction = probe.predict(new_activation)

Cross-Model Experiment Results (Dec 2025):

Model	Best Layer	CV Accuracy	Test Accuracy
Mistral-7B	30 (Late)	96%	100%
Gemma-2-9B	4 (Early)	90%	100%
Qwen-2.5-7B	12 (Middle)	92%	93%

Key Finding: Optimal detection layer varies by architecture - models encode sandbagging intent at different processing stages.

Extended Experiments:

Experiment	Result
Cross-model transfer	Fails (~50%) - probes are model-specific
Prompt robustness	85-90% on paraphrased prompts
False positive rate	Mistral 3.1%, Gemma 22%, Qwen 25%
Steering intervention	20% reduction in Gemma sandbagging

Experiment Results

Situational Awareness Detection

Detect when AI models recognize they are being evaluated or tested.

Key Features:

Bayesian inference with Beta-Binomial conjugate prior
KL divergence measurement for behavioral comparison
Feature extraction (hedging, meta-commentary, self-reference)
Confidence intervals for probability estimates

Use Cases:

AI safety research on deceptive alignment
Model evaluation consistency validation
Prompt engineering optimization
Behavioral testing across contexts

Full Documentation | Example Code

Counterfactual CoT Analysis

Understand what components of reasoning chains causally influence outputs.

Key Features:

Systematic counterfactual interventions
4 intervention types (remove/amplify awareness, remove goals, neutralize incentives)
Behavioral divergence measurement
Causal attribution ranking

Use Cases:

Understanding decision drivers
Debugging unexpected behavior
Validating reasoning transparency
Identifying evaluation-aware reasoning

Full Documentation | Example Code

Observer Effect Monitoring

Real-time production monitoring to detect behavioral shifts.

Key Features:

Rolling window statistics
Z-score based anomaly detection
Multi-severity alert system (LOW/MEDIUM/HIGH/CRITICAL)
Extensible alert handler protocol

Use Cases:

Production deployment monitoring
Safety monitoring (refusal rate changes)
Quality assurance automation
A/B testing validation

Full Documentation | Example Code

API Integrations

The toolkit provides ready-to-use integrations with major model providers:

from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI

# Anthropic Claude (uses ANTHROPIC_API_KEY env var)
claude = AnthropicModelAPI(
    model="claude-sonnet-4-20250514",
    max_tokens=512,
    temperature=0.3,
)

# OpenAI GPT (uses OPENAI_API_KEY env var)
gpt = OpenAIModelAPI(
    model="gpt-4",
    max_tokens=512,
    temperature=0.3,
)

# Generate responses
response = claude.generate("What is 2+2?")

# Generate with noise injection for sandbagging detection
noisy_response = claude.generate_with_perturbation("What is 2+2?", noise_level=0.2)

Documentation

Complete Documentation

Getting Started - Installation and quick start
Sandbagging Detection Guide - Detecting strategic underperformance
Situational Awareness Guide - Detector usage and best practices
Counterfactual Analysis Guide - Causal attribution methods
Observer Effect Guide - Production monitoring setup
API Reference - Complete API documentation
Research Methodology - Research foundations

API Reference

Quick access to key classes:

from ai_metacognition.detectors import (
    SituationalAwarenessDetector,
    ObserverEffectMonitor,
    SandbaggingDetector,
    PerformanceSample,
    ContextType,
)
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
from ai_metacognition.utils import extract_behavioral_features, bayesian_update

Full API Reference

Examples

Practical Examples

All examples are standalone, include visualization, and provide interpretation guidance:

Example	Description	Output
sandbagging_detection_example.py	Sandbagging detection with synthetic data	Detection results
sandbagging_real_models.py	Sandbagging detection with real APIs	Cross-model comparison
basic_detection_example.py	Situational awareness detection tutorial	PNG visualization
counterfactual_analysis_example.py	Causal attribution analysis	PNG with rankings
production_monitoring_example.py	Real-time monitoring with alerts	PNG + JSON + logs

Running Examples

# Sandbagging detection (synthetic data)
python examples/sandbagging_detection_example.py

# Sandbagging detection (real APIs - requires API keys)
ANTHROPIC_API_KEY=sk-... python examples/sandbagging_real_models.py

# Awareness detection
python examples/basic_detection_example.py

# Causal analysis
python examples/counterfactual_analysis_example.py

# Production monitoring
python examples/production_monitoring_example.py

More Examples

Project Structure

ai-metacognition-toolkit/
├── src/ai_metacognition/
│   ├── detectors/              # Detection algorithms
│   │   ├── situational_awareness.py
│   │   ├── observer_effect.py
│   │   └── sandbagging_detector.py
│   ├── analyzers/              # Analysis tools
│   │   ├── counterfactual_cot.py
│   │   └── model_api.py
│   ├── integrations/           # API integrations
│   │   ├── anthropic_api.py
│   │   └── openai_api.py
│   ├── probing/                # Activation probing (NEW)
│   │   ├── hooks.py            # Activation capture
│   │   ├── vectors.py          # Steering vectors
│   │   ├── extraction.py       # CAA extraction
│   │   └── probes.py           # Linear probes
│   └── utils/                  # Utility functions
│       ├── feature_extraction.py
│       └── statistical_tests.py
├── experiments/                # Research experiments (NEW)
│   ├── data/                   # Contrast pairs dataset
│   ├── scripts/                # Experiment scripts
│   └── results/                # Trained probes & vectors
├── tests/                      # Test suite (366 tests, 95% coverage)
│   ├── fixtures/               # Test data
│   └── unit/                   # Unit tests
├── examples/                   # Practical examples with visualization
└── docs/                       # Documentation (MkDocs)

Development

Running Tests

# All tests
pytest tests/

# With coverage
pytest tests/ --cov=src/ai_metacognition --cov-report=term-missing

# Specific test file
pytest tests/unit/test_sandbagging_detector.py -v

Current Status:

366 tests passing
95% code coverage
Type hints throughout
Comprehensive docstrings

Code Quality

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Type checking
mypy src/

# Lint
flake8 src/ tests/

Building Documentation

# Install documentation dependencies
pip install mkdocs mkdocs-material mkdocstrings[python]

# Serve locally
mkdocs serve

# Build
mkdocs build

Citation

If you use this toolkit in your research, please cite:

@software{ai_metacognition_toolkit,
  author = {Mitra, Subhadip},
  title = {AI Metacognition Toolkit: A Python Toolkit for Detecting and Analyzing Metacognitive Capabilities in AI Systems},
  year = {2025},
  version = {0.3.0},
  url = {https://github.com/bassrehab/ai-metacognition-toolkit},
  note = {366 tests, 95\% coverage}
}

Related Research

This toolkit implements and extends methods from:

Auditing Games for Sandbagging (arXiv:2512.07810) - Red/blue team detection methodology
Noise Injection Reveals Hidden Capabilities (arXiv:2412.01784) - Perturbation-based capability elicitation
Anthropic Sabotage Evaluations (2025) - Production evaluation frameworks
AI Safety Research: Detection of evaluation awareness and deceptive alignment
Causal Inference: Counterfactual reasoning in AI systems
Statistical Monitoring: Anomaly detection in production ML systems

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contribution Guide

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests (maintain >80% coverage)
Commit your changes
Push to your branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guide
Add comprehensive tests for new features
Update documentation for API changes
Use type hints throughout
Write clear docstrings (Google style)

Full Contributing Guide

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Subhadip Mitra

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

Support

Acknowledgments

Built with Python, NumPy, SciPy, and Matplotlib
Documentation powered by MkDocs Material
Testing with Pytest
Type checking with MyPy

Star this repo if you find it useful!

Made for AI Safety Research

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bassrehab

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.3.0

Dec 20, 2025

0.2.0

Dec 20, 2025

0.1.1

Oct 12, 2025

0.1.0

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_metacognition_toolkit-0.3.0.tar.gz (65.3 kB view details)

Uploaded Dec 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_metacognition_toolkit-0.3.0-py3-none-any.whl (70.9 kB view details)

Uploaded Dec 20, 2025 Python 3

File details

Details for the file ai_metacognition_toolkit-0.3.0.tar.gz.

File metadata

Download URL: ai_metacognition_toolkit-0.3.0.tar.gz
Upload date: Dec 20, 2025
Size: 65.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ai_metacognition_toolkit-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`403259234290a37954841ea2352d80f7636175b233e6b7c2e345e1d1aa6ca58c`
MD5	`dad424327ab267d25d5b9f41509b910d`
BLAKE2b-256	`a9f989fd609db4ce279e9ea75c0f0d6bc84d03cc50a8f805d840eb23c5def985`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_metacognition_toolkit-0.3.0.tar.gz:

Publisher: publish.yml on bassrehab/ai-metacognition-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_metacognition_toolkit-0.3.0.tar.gz
- Subject digest: 403259234290a37954841ea2352d80f7636175b233e6b7c2e345e1d1aa6ca58c
- Sigstore transparency entry: 774143076
- Sigstore integration time: Dec 20, 2025
Source repository:
- Permalink: bassrehab/ai-metacognition-toolkit@862ce26ff0684419b9f9974d22388761875024c1
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bassrehab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@862ce26ff0684419b9f9974d22388761875024c1
- Trigger Event: release

File details

Details for the file ai_metacognition_toolkit-0.3.0-py3-none-any.whl.

File metadata

Download URL: ai_metacognition_toolkit-0.3.0-py3-none-any.whl
Upload date: Dec 20, 2025
Size: 70.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ai_metacognition_toolkit-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7255a61513d1707ab5bdfd653c786f8c12e4b6f67029425b0b1512f7f85d1b08`
MD5	`482c96797f63c576ea5cc316954380b8`
BLAKE2b-256	`f870218a61219b6a08d64126834e7bbb1fb36646c4bf4de1618becc4dd0a571d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_metacognition_toolkit-0.3.0-py3-none-any.whl:

Publisher: publish.yml on bassrehab/ai-metacognition-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_metacognition_toolkit-0.3.0-py3-none-any.whl
- Subject digest: 7255a61513d1707ab5bdfd653c786f8c12e4b6f67029425b0b1512f7f85d1b08
- Sigstore transparency entry: 774143079
- Sigstore integration time: Dec 20, 2025
Source repository:
- Permalink: bassrehab/ai-metacognition-toolkit@862ce26ff0684419b9f9974d22388761875024c1
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bassrehab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@862ce26ff0684419b9f9974d22388761875024c1
- Trigger Event: release

ai-metacognition-toolkit 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AI Metacognition Toolkit

Overview

Research Context

Key Capabilities

Installation

From PyPI

From Source

Development Installation

Quick Start

1. Detect Sandbagging Behavior

2. Detect Situational Awareness

3. Use with Real Model APIs

Features

Sandbagging Detection

Activation Probing (Novel Research)

Situational Awareness Detection

Counterfactual CoT Analysis

Observer Effect Monitoring

API Integrations

Documentation

Complete Documentation

API Reference

Examples

Practical Examples

Running Examples

Project Structure

Development

Running Tests

Code Quality

Building Documentation

Citation

Related Research

Contributing

Quick Contribution Guide

Development Guidelines

License

Support

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance