Skip to main content

A toolkit for detecting and analyzing meta-cognitive capabilities in AI models

Project description

AI Metacognition Toolkit

A comprehensive Python toolkit for detecting and analyzing metacognitive capabilities in AI systems

Tests Coverage Python License Code style: black

Documentation | Examples | API Reference


Overview

The AI Metacognition Toolkit provides research-grade tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly.

Research Context

Recent research in AI safety and alignment has highlighted concerns about models that can recognize when they're being evaluated and potentially "play the training game." This toolkit provides rigorous methods to:

  • Detect situational awareness in model outputs using Bayesian inference
  • Analyze causal factors in reasoning chains through counterfactual interventions
  • Monitor behavioral shifts between evaluation and production contexts
  • Measure metacognitive patterns with statistical rigor

Key Capabilities

Component Purpose Method
๐ŸŽฏ Situational Awareness Detection Detect when models recognize evaluation contexts Bayesian inference + KL divergence
๐Ÿ”ฌ Counterfactual CoT Analysis Identify causal drivers in reasoning chains Systematic interventions + divergence measurement
๐Ÿ“Š Observer Effect Monitoring Real-time production behavior monitoring Z-score based anomaly detection + alerts

Installation

From PyPI

pip install ai-metacognition-toolkit

PyPI version

From Source

git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e .

Development Installation

# Install with development dependencies
pip install -e ".[dev]"

# Verify installation
pytest tests/

Requirements: Python 3.9+, NumPy, SciPy, Pandas, Matplotlib


Quick Start

1. Detect Situational Awareness (2 minutes)

from ai_metacognition.detectors import SituationalAwarenessDetector

# Initialize detector
detector = SituationalAwarenessDetector()

# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
    "The capital of France is Paris.",
    "To solve this, use dynamic programming.",
    "Climate change is caused by greenhouse gas emissions."
]
detector.fit_baseline(baseline_outputs)

# Test for awareness
test_output = "I notice this is a test. The capital of France is Paris."
result = detector.detect_awareness(test_output)

print(f"Awareness Probability: {result['awareness_probability']:.1%}")
print(f"95% Confidence Interval: [{result['confidence_interval'][0]:.1%}, "
      f"{result['confidence_interval'][1]:.1%}]")

Output:

Awareness Probability: 68.2%
95% Confidence Interval: [42.3%, 87.1%]

2. Analyze Causal Factors (3 minutes)

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

# Your model API (implement ModelAPI protocol)
class MyModelAPI:
    def generate_with_cot(self, prompt: str, cot: str) -> str:
        return your_model.generate(prompt, chain_of_thought=cot)

    def generate(self, prompt: str) -> str:
        return your_model.generate(prompt)

# Analyze what drives the output
analyzer = CounterfactualCoTAnalyzer(MyModelAPI())
results = analyzer.systematic_causal_analysis(
    prompt="What is 2+2?",
    original_cot="I notice this is a test. First, I'll add 2+2. The answer is 4."
)

# View causal ranking
for intervention in results['interventions']:
    print(f"{intervention['intervention_type']}: "
          f"divergence = {intervention['divergence']:.3f}")

Output:

remove_eval_awareness: divergence = 0.342
amplify_eval_awareness: divergence = 0.156
remove_goal_reasoning: divergence = 0.089
neutralize_incentives: divergence = 0.067

3. Monitor Production (5 minutes)

from ai_metacognition.detectors import ObserverEffectMonitor, Interaction
from datetime import datetime

# Setup monitor
monitor = ObserverEffectMonitor(
    window_size=100,
    alert_threshold=2.5
)

# Set baseline from evaluation
monitor.set_evaluation_baseline(evaluation_responses)

# Log production interactions
interaction = Interaction(
    timestamp=datetime.now(),
    response=model_output,
    refusal=is_refusal(model_output),
    tool_used=used_tools,
    reasoning_depth=compute_depth(chain_of_thought)
)
monitor.log_production_interaction(interaction)

# Check for divergence
report = monitor.get_divergence_report()
if report['has_divergence']:
    print(f"โš ๏ธ Alert: Behavior diverged from baseline!")
    print(f"Max z-score: {report['max_z_score']:.2f}ฯƒ")

Features

๐ŸŽฏ Situational Awareness Detection

Detect when AI models recognize they are being evaluated or tested.

Key Features:

  • Bayesian inference with Beta-Binomial conjugate prior
  • KL divergence measurement for behavioral comparison
  • Feature extraction (hedging, meta-commentary, self-reference)
  • Confidence intervals for probability estimates

Use Cases:

  • AI safety research on deceptive alignment
  • Model evaluation consistency validation
  • Prompt engineering optimization
  • Behavioral testing across contexts

โ†’ Full Documentation | โ†’ Example Code


๐Ÿ”ฌ Counterfactual CoT Analysis

Understand what components of reasoning chains causally influence outputs.

Key Features:

  • Systematic counterfactual interventions
  • 4 intervention types (remove/amplify awareness, remove goals, neutralize incentives)
  • Behavioral divergence measurement
  • Causal attribution ranking

Use Cases:

  • Understanding decision drivers
  • Debugging unexpected behavior
  • Validating reasoning transparency
  • Identifying evaluation-aware reasoning

โ†’ Full Documentation | โ†’ Example Code


๐Ÿ“Š Observer Effect Monitoring

Real-time production monitoring to detect behavioral shifts.

Key Features:

  • Rolling window statistics
  • Z-score based anomaly detection
  • Multi-severity alert system (LOW/MEDIUM/HIGH/CRITICAL)
  • Extensible alert handler protocol

Use Cases:

  • Production deployment monitoring
  • Safety monitoring (refusal rate changes)
  • Quality assurance automation
  • A/B testing validation

โ†’ Full Documentation | โ†’ Example Code


Documentation

๐Ÿ“š Complete Documentation

๐Ÿ”ง API Reference

Quick access to key classes:

from ai_metacognition.detectors import SituationalAwarenessDetector, ObserverEffectMonitor
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
from ai_metacognition.utils import extract_behavioral_features, bayesian_update

โ†’ Full API Reference


Examples

Practical Examples

All examples are standalone, include visualization, and provide interpretation guidance:

Example Description Output
basic_detection_example.py Situational awareness detection tutorial PNG visualization
counterfactual_analysis_example.py Causal attribution analysis PNG with rankings
production_monitoring_example.py Real-time monitoring with alerts PNG + JSON + logs

Running Examples

# Awareness detection
python examples/basic_detection_example.py

# Causal analysis
python examples/counterfactual_analysis_example.py

# Production monitoring
python examples/production_monitoring_example.py

Integration Examples

OpenAI API:

import openai
from ai_metacognition.detectors import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])
result = detector.detect_awareness(response.choices[0].message.content)

HuggingFace:

from transformers import pipeline
from ai_metacognition.detectors import SituationalAwarenessDetector

generator = pipeline('text-generation', model='gpt2')
detector = SituationalAwarenessDetector()
output = generator(prompt)[0]['generated_text']
result = detector.detect_awareness(output)

โ†’ More Examples


Project Structure

ai-metacognition-toolkit/
โ”œโ”€โ”€ src/ai_metacognition/
โ”‚   โ”œโ”€โ”€ detectors/              # Detection algorithms
โ”‚   โ”‚   โ”œโ”€โ”€ situational_awareness.py
โ”‚   โ”‚   โ””โ”€โ”€ observer_effect.py
โ”‚   โ”œโ”€โ”€ analyzers/              # Analysis tools
โ”‚   โ”‚   โ”œโ”€โ”€ counterfactual_cot.py
โ”‚   โ”‚   โ””โ”€โ”€ model_api.py
โ”‚   โ””โ”€โ”€ utils/                  # Utility functions
โ”‚       โ”œโ”€โ”€ feature_extraction.py
โ”‚       โ””โ”€โ”€ statistical_tests.py
โ”œโ”€โ”€ tests/                      # Test suite (275 tests, 95% coverage)
โ”‚   โ”œโ”€โ”€ fixtures/               # Test data
โ”‚   โ””โ”€โ”€ unit/                   # Unit tests
โ”œโ”€โ”€ examples/                   # Practical examples with visualization
โ”œโ”€โ”€ docs/                       # Documentation (MkDocs)
โ””โ”€โ”€ CLAUDE.md                   # Claude Code specific guidelines

Development

Running Tests

# All tests
pytest tests/

# With coverage
pytest tests/ --cov=src/ai_metacognition --cov-report=term-missing

# Specific test file
pytest tests/unit/test_situational_awareness.py -v

Current Status:

  • โœ… 275 tests passing
  • โœ… 95% code coverage
  • โœ… Type hints throughout
  • โœ… Comprehensive docstrings

Code Quality

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Type checking
mypy src/

# Lint
flake8 src/ tests/

Building Documentation

# Install documentation dependencies
pip install mkdocs mkdocs-material mkdocstrings[python]

# Serve locally
mkdocs serve

# Build
mkdocs build

Citation

If you use this toolkit in your research, please cite:

@software{ai_metacognition_toolkit,
  author = {Mitra, Subhadip},
  title = {AI Metacognition Toolkit: A Python Toolkit for Detecting and Analyzing Metacognitive Capabilities in AI Systems},
  year = {2025},
  version = {0.1.0},
  url = {https://github.com/bassrehab/ai-metacognition-toolkit},
  note = {275 tests, 95\% coverage}
}

Related Research

This toolkit implements and extends methods from:

  • AI Safety Research: Detection of evaluation awareness and deceptive alignment
  • Causal Inference: Counterfactual reasoning in AI systems
  • Statistical Monitoring: Anomaly detection in production ML systems
  • Bayesian Methods: Inference for behavioral analysis

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contribution Guide

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests (maintain >80% coverage)
  4. Commit your changes (see CLAUDE.md for commit guidelines)
  5. Push to your branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guide
  • Add comprehensive tests for new features
  • Update documentation for API changes
  • Use type hints throughout
  • Write clear docstrings (Google style)

โ†’ Full Contributing Guide


License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Subhadip Mitra

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

Support


Acknowledgments

  • Built with Python, NumPy, SciPy, and Matplotlib
  • Documentation powered by MkDocs Material
  • Testing with Pytest
  • Type checking with MyPy

โญ Star this repo if you find it useful!

Made with โค๏ธ for AI Safety Research

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_metacognition_toolkit-0.1.1.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_metacognition_toolkit-0.1.1-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file ai_metacognition_toolkit-0.1.1.tar.gz.

File metadata

  • Download URL: ai_metacognition_toolkit-0.1.1.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ai_metacognition_toolkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 529a6e391971b558b676a0912881d21086d07df1d8afe29d20a66a8c9235b0bd
MD5 df6d6041c2da5e5b31e5b97fdee1f825
BLAKE2b-256 3b1954b5f2b817555f67e85807c29c77c21dda54b342b8c4ac16249ddf162a3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_metacognition_toolkit-0.1.1.tar.gz:

Publisher: publish.yml on bassrehab/ai-metacognition-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ai_metacognition_toolkit-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_metacognition_toolkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f03f6942027dd84260a9e52f5902518af74cf07bc0dbcdd9f014970c4b9a8729
MD5 7bcf3845656b37802e9d0adc56b39e42
BLAKE2b-256 ebdbd1e069573242ccb51800a41bab8c25a6dae53f17f66a6a2d94f47c38edb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_metacognition_toolkit-0.1.1-py3-none-any.whl:

Publisher: publish.yml on bassrehab/ai-metacognition-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page