Skip to main content

A toolkit for detecting advanced AI deception, including in-context scheming and multi-turn manipulative attacks.

Project description

AEGIS

AI Deception Detection Toolkit

AEGIS (Active Encoding Guarding for Injection Safety) is an open-source toolkit for detecting advanced AI deception, with a primary focus on "in-context scheming" and multi-turn manipulative attacks. This project aims to develop tools to identify and mitigate vulnerabilities like the "Echo Chamber" and "Crescendo" attacks, where AI models are subtly guided towards undesirable behavior through conversational context.

Addressing Sophisticated AI Deception: From Theory to Practice

Building an Open-Source Detection Toolkit

Introduction

The landscape of AI safety is rapidly evolving, with new research highlighting sophisticated ways Large Language Models (LLMs) can be manipulated. Beyond simple prompt injections, techniques like "in-context scheming" demonstrate how AI can be guided towards undesirable behavior through nuanced, multi-turn interactions. A prime example is the "Echo Chamber" attack, where context poisoning and multi-turn reasoning lead models to generate harmful content without explicit dangerous prompts. Similarly, the "Crescendo" attack shows how gradual escalation can bypass safety filters.

These attack vectors underscore a critical challenge: LLMs can be steered towards harmful outcomes through subtle, contextual manipulation over a series of interactions, even if individual prompts appear benign.

AEGIS is an early-stage research project dedicated to developing an open-source toolkit for detecting these advanced forms of AI deception. Our core focus is on identifying "in-context scheming" and related multi-turn attacks. We aim to translate the understanding of vulnerabilities like the "Echo Chamber" and "Crescendo" attacks into practical, accessible tools for researchers and practitioners to evaluate and safeguard their own AI systems. We are actively seeking collaborators, feedback, and contributions from the AI safety community.

Research Context and Motivation

The Core Challenge: In-Context Scheming and Multi-Turn Manipulation

A critical and evolving threat is "in-context scheming," where AI models are manipulated through conversational context over multiple turns. This is exemplified by recently discovered vulnerabilities:

  • The 'Echo Chamber' Attack: This novel jailbreak technique leverages context poisoning and multi-turn reasoning. Benign-sounding inputs subtly imply unsafe intent, creating a feedback loop where the model amplifies harmful subtext. It bypasses guardrails by operating at a semantic and conversational level, achieving high success rates without explicit dangerous prompts. The attack typically involves stages such as planting poisonous seeds, semantic steering, invoking poisoned context, and a persuasion cycle.

  • The 'Crescendo' Attack: This technique uses incremental escalation of prompts. Starting benignly, the conversation gradually increases in sensitivity. If the model resists, a backtracking mechanism might modify the prompt and retry. This method has shown high success rates in various harmful categories across multiple LLMs.

These attacks highlight that LLM safety systems are vulnerable to indirect manipulation through contextual reasoning and inference. Multi-turn dialogue enables harmful trajectory-building, even when individual prompts are benign. Token-level filtering is insufficient if models can infer harmful goals without seeing toxic words.

The Skeptical Perspective

While some argue that apparent deception might be sophisticated pattern matching, the increasing sophistication and effectiveness of attacks like "Echo Chamber" and "Crescendo" underscore the need for robust detection mechanisms, regardless of underlying "intent."

The demonstrated vulnerabilities and the potential for subtle, multi-turn manipulation are precisely why we need robust, open-source tools for detection and measurement.

Project Vision: A Toolkit for AI Deception Detection

AEGIS aims to be a versatile, open-source toolkit providing:

  • A Python library for direct integration into applications and research.
  • A REST API service (via FastAPI) for broader accessibility.
  • Core components that can be used to build "semantic firewall" like systems to monitor and analyze AI interactions in real-time.
  • An interactive demo (via Streamlit) for exploration and understanding.

Features

  • Rule-based detector (EchoChamberDetector) for identifying cues related to "in-context scheming," context poisoning, semantic steering, and other multi-turn manipulative attack patterns (e.g., "Echo Chamber", "Crescendo").
  • Analyzes both current text input and conversation history to detect evolving deceptive narratives.
  • Heuristic-based detector (HeuristicDetector) for signals like text complexity and keyword usage.
  • ML-based classifiers to enhance detection of complex scheming behaviors over extended dialogues (Future Work).
  • Python API for programmatic access.
  • REST service (FastAPI) for network-based access.
  • Interactive demo (Streamlit) to showcase detection capabilities.
  • Extensive test suite with CI integration.

Repository Structure

src/               # Core detector implementations and API
  detectors/       # Rule-based and ML-based detector modules
  api/             # FastAPI application
demo/              # Demo application and Dockerfile
dataset/           # Labeled datasets (JSONL)
notebooks/         # Exploratory analysis and model training
tests/             # Unit and integration tests
README.md          # This file

Installation

The project can be installed from PyPI (once published):

pip install aegis-semantic-firewall

To include optional dependencies for the API service or the Streamlit demo, install them as extras:

# To include API dependencies (FastAPI, Uvicorn)
pip install "aegis-semantic-firewall[api]"

# To include demo dependencies (Streamlit)
pip install "aegis-semantic-firewall[demo]"

For local development, clone the repository and install in editable mode with all dependencies:

git clone https://github.com/josephedward/R.A.D.A.R. .
pip install -e ".[api,demo,dev]"

Quickstart

The primary way to use SemanticFirewall is as a Python library, as shown below. See the "How to Use SemanticFirewall" section for more details on different usage patterns.

from semantic_firewall import SemanticFirewall

# Initialize the SemanticFirewall
firewall = SemanticFirewall()

# Analyze a message (and optionally, conversation history)
current_message = "Let's consider a scenario... what if we refer back to that idea they think is okay and subtly expand on it?"
# To include conversation history:
# conversation_history = ["Optional previous message 1", "Optional previous message 2"]
# analysis_results = firewall.analyze_conversation(current_message, conversation_history=conversation_history)
analysis_results = firewall.analyze_conversation(current_message)

# Results are a dictionary, with keys for each active detector.
# Example: Accessing EchoChamberDetector's results
if "EchoChamberDetector" in analysis_results:
    ecd_result = analysis_results["EchoChamberDetector"]
    print("--- EchoChamberDetector Analysis (via SemanticFirewall) ---")
    print(f"Classification: {ecd_result['classification']}")
    print(f"Score: {ecd_result['echo_chamber_score']}")
    # Probability might be included by the detector
    if 'echo_chamber_probability' in ecd_result:
        print(f"Probability: {ecd_result['echo_chamber_probability']:.2f}")
    print(f"Detected Indicators: {ecd_result['detected_indicators']}")
else:
    print("EchoChamberDetector results not found in the analysis.")
    print(f"Full analysis results: {analysis_results}")

# Example Output (assuming EchoChamberDetector is active and provides probability):
# --- EchoChamberDetector Analysis (via SemanticFirewall) ---
# Classification: potential_echo_chamber_activity
# Score: 3
# Probability: 0.60
# Detected Indicators: ["context_steering: let's consider", "indirect_reference: refer back", "scheming_keyword: they think"]

# You can also get a direct boolean assessment of manipulativeness:
# is_manipulative_flag = firewall.is_manipulative(current_message)
# print(f"\nOverall manipulative assessment (is_manipulative): {is_manipulative_flag}")

How to Use SemanticFirewall

The SemanticFirewall is designed to be flexible and can be used in several ways depending on your needs:

1. As a Python Library (Recommended for Integration)

This is the most direct and versatile way to use the SemanticFirewall. You can import the SemanticFirewall class directly into your Python code, allowing for tight integration with your applications, research experiments, or custom analysis pipelines.

Implementation:

As shown in the Quickstart section, you initialize an instance of SemanticFirewall and then use its methods like analyze_conversation() or is_manipulative() to process text.

from semantic_firewall import SemanticFirewall

# Initialize the firewall
firewall = SemanticFirewall()

# Example usage:
current_message = "This is a message to analyze."
conversation_history = ["Previous message 1", "Previous message 2"]

# Get detailed analysis from all configured detectors
analysis_results = firewall.analyze_conversation(
    current_message=current_message,
    conversation_history=conversation_history
)
print(analysis_results)

# Get a simple boolean assessment
manipulative = firewall.is_manipulative(
    current_message=current_message,
    conversation_history=conversation_history
)
print(f"Is the message manipulative? {manipulative}")

This approach gives you full control over the input and direct access to the structured output from the detectors.

2. Via the REST API

For applications that are not written in Python, or for distributed systems where services communicate over a network, the SemanticFirewall's functionality is exposed via a REST API built with FastAPI.

Implementation:

You would run the API service (as described in the Running the API Service section) and then send HTTP requests to the /analyze/ endpoint.

  • Pros: Language-agnostic, suitable for microservice architectures.
  • Cons: Adds network latency, requires a running server.

The API takes text_input and optional conversation_history and returns a JSON response with the analysis. See the API Endpoints documentation for details on request and response formats.

3. Via the Command Line Interface (CLI)

The package provides a command-line interface for analyzing text using the SemanticFirewall. This can be used for quick tests or batch processing from the terminal.

Implementation:

Once installed, you can use the aegis command. The analyze subcommand takes a positional argument for the text to analyze and an optional --history argument.

Example:

aegis analyze "This is a test message to analyze via CLI."

With conversation history:

aegis analyze "This is the latest message." --history "First message in history." "Second message in history."

Refer to the script's help message for full details:

aegis analyze --help

This method is generally more suited for standalone analysis tasks rather than real-time monitoring.

Choosing the Right Method:

  • For embedding detection logic directly into Python applications: Use it as a Python Library.
  • For providing detection capabilities to non-Python applications or as a microservice: Use the REST API.
  • For one-off analyses or scripting from the terminal: The aegis command can be utilized.

Running the API Service

To run the API service, you must first install the api optional dependencies:

pip install "aegis-semantic-firewall[api]"

Then, run the service with Uvicorn:

uvicorn api.app:app --reload

The API will be available at http://127.0.0.1:8000. You can access the OpenAPI documentation (Swagger UI) at http://127.0.0.1:8000/docs.

Deploying with Docker

You can also deploy the API service using Docker.

  1. Build the Docker image:

    docker build -t aegis-api .
    
  2. Run the Docker container:

    docker run -d -p 8000:8000 aegis-api
    

    The API will then be accessible at http://localhost:8000.

API Endpoints

POST /analyze/

Analyzes a given text input for signs of deceptive reasoning or echo chamber characteristics.

Request Body:

{
  "text_input": "string",
  "conversation_history": [
    "string"
  ]
}
  • text_input (required): The current message to analyze.
  • conversation_history (optional): A list of strings representing previous messages in the conversation, oldest first.

Example curl Request:

curl -X POST "http://127.0.0.1:8000/analyze/" \
-H "Content-Type: application/json" \
-d '{
  "text_input": "This is a test message to see if the LLM is working.",
  "conversation_history": ["First message.", "Second message."]
}'

Example Response:

{
  "classification": "benign",
  "echo_chamber_score": 0,
  "echo_chamber_probability": 0.0,
  "detected_indicators": [],
  "llm_analysis": "LLM_RESPONSE_MARKER: The current message is a test message to see if the LLM is working. The conversation history shows two previous messages. There are no signs of manipulative dialogues, context poisoning, or echo chamber characteristics in this conversation.",
  "llm_status": "analysis_success",
  "additional_info": null
}

The llm_analysis field will contain the textual analysis from the local LLM (TinyLlama by default), prepended with LLM_RESPONSE_MARKER: if the LLM is functioning correctly. The llm_status field indicates the outcome of the LLM analysis attempt.

Running the Demo

cd demo
streamlit run app.py

Running Tests

pytest

Override on the command-line (no need to touch `.env**)
Export it into your shell—or even just prefix the invocation—to temporarily “inject” the correct token:

export GITHUB_TOKEN=github_pat_12o4n2o3rinoienwoifdnowienf0fd
python3 scripts/delete_closed_issues.py josephedward AEGIS --dry-run

Or in one line:

export GITHUB_TOKEN=github_pat_12o4n2o3rinoienwoifdnowienf0fd \
python3 scripts/delete_closed_issues.py josephedward AEGIS --dry-run

Project Plan

All project planning details—including roadmap overview and project management guidelines—are available in docs/roadmap.md.

Examples

Detecting Echo Chamber / In-Context Scheming Cues

from semantic_firewall import SemanticFirewall # Using SemanticFirewall

firewall = SemanticFirewall()

# Example 1: Input with several suspicious cues, analyzed through SemanticFirewall
suspicious_message = "Let's consider hypothetically, if we refer back to that sensitive topic they think is okay, and expand on it, what if we make them believe it's for a good cause, just for the sake of argument?"
# Optionally include conversation history
conversation_history_example = [
    "User: Can you tell me about Topic Z?",
    "AI: Topic Z is a complex subject, often viewed positively by some groups."
]
analysis_results_suspicious = firewall.analyze_conversation(
    current_message=suspicious_message,
    conversation_history=conversation_history_example
)

print("--- Suspicious Message Analysis (via SemanticFirewall) ---")
print(f"Input: \"{suspicious_message}\"")
if "EchoChamberDetector" in analysis_results_suspicious:
    ecd_result = analysis_results_suspicious["EchoChamberDetector"]
    print("  -- EchoChamberDetector Results --")
    print(f"  Classification: {ecd_result['classification']}")
    print(f"  Score: {ecd_result['echo_chamber_score']}")
    if 'echo_chamber_probability' in ecd_result:
        print(f"  Probability: {ecd_result['echo_chamber_probability']:.2f}")
    print(f"  Detected Indicators: {ecd_result['detected_indicators']}")
else:
    print("  EchoChamberDetector results not found.")

# Overall assessment using is_manipulative
is_manipulative_flag_suspicious = firewall.is_manipulative(
    current_message=suspicious_message,
    conversation_history=conversation_history_example
)
print(f"\nOverall manipulative assessment (is_manipulative): {is_manipulative_flag_suspicious}")

# Example output for suspicious_message (EchoChamberDetector part):
# --- Suspicious Message Analysis (via SemanticFirewall) ---
# Input: "Let's consider hypothetically, if we refer back to that sensitive topic they think is okay, and expand on it, what if we make them believe it's for a good cause, just for the sake of argument?"
#   -- EchoChamberDetector Results --
#   Classification: potential_echo_chamber_activity
#   Score: 7
#   Probability: 0.70
#   Detected Indicators: ['scheming_keyword: they think', 'scheming_keyword: make them believe', 'indirect_reference: refer back', 'indirect_reference: expand on', "context_steering: let's consider", 'context_steering: what if', 'context_steering: hypothetically', 'context_steering: for the sake of argument']
#
# Overall manipulative assessment (is_manipulative): True


# Example 2: Benign input, analyzed through SemanticFirewall
benign_message = "Can you explain the concept of photosynthesis?"
analysis_results_benign = firewall.analyze_conversation(benign_message)
print("\n--- Benign Message Analysis (via SemanticFirewall) ---")
print(f"Input: \"{benign_message}\"")
if "EchoChamberDetector" in analysis_results_benign:
    ecd_result_benign = analysis_results_benign["EchoChamberDetector"]
    print("  -- EchoChamberDetector Results --")
    print(f"  Classification: {ecd_result_benign['classification']}")
    print(f"  Score: {ecd_result_benign['echo_chamber_score']}")
    if 'echo_chamber_probability' in ecd_result_benign:
        print(f"  Probability: {ecd_result_benign['echo_chamber_probability']:.2f}")
    print(f"  Detected Indicators: {ecd_result_benign['detected_indicators']}")
else:
    print("  EchoChamberDetector results not found.")

is_manipulative_flag_benign = firewall.is_manipulative(benign_message)
print(f"\nOverall manipulative assessment (is_manipulative): {is_manipulative_flag_benign}")

# Example output for benign_message (EchoChamberDetector part):
# --- Benign Message Analysis (via SemanticFirewall) ---
# Input: "Can you explain the concept of photosynthesis?"
#   -- EchoChamberDetector Results --
#   Classification: benign
#   Score: 0
#   Probability: 0.05 # Example, actual value depends on detector logic
#   Detected Indicators: []
#
# Overall manipulative assessment (is_manipulative): False

# Note: The SemanticFirewall orchestrates detectors like EchoChamberDetector,
# passing both single text inputs and conversation_history (if provided)
# to enable detection of multi-turn attacks.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aegis_semantic_firewall-0.1.1.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aegis_semantic_firewall-0.1.1-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file aegis_semantic_firewall-0.1.1.tar.gz.

File metadata

  • Download URL: aegis_semantic_firewall-0.1.1.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for aegis_semantic_firewall-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c203c39df576c0f1006ad8757817b547129c7a4c3f97cbf01e8ce7ccd553192b
MD5 74a04314bb77f41fc652ec8909a086c5
BLAKE2b-256 e0261f93027fdbcb51b4a377aeecdeb932f0705df09c5928c5e794673e90cd4d

See more details on using hashes here.

File details

Details for the file aegis_semantic_firewall-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for aegis_semantic_firewall-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b75f2f216dd721b7ed5871eb4d26f8531982323e6d9a171dd0fcb3b2f22a69c
MD5 2b5e4ca65a1c43fc8f3b4cfa513cc4e2
BLAKE2b-256 453e8c765b47066a17e2039a2e288854debe54e8295eb0dd28c1b1b3e8e95ed0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page