Skip to main content

Production-ready multi-agent debate framework with adaptive evaluation and safety monitoring

Project description

ARTEMIS Logo

ARTEMIS Agents

Python 3.10+ License: Apache 2.0 Code style: black

Adaptive Reasoning Through Evaluation of Multi-agent Intelligent Systems

A production-ready framework for structured multi-agent debates with adaptive evaluation, causal reasoning, and built-in safety monitoring.


What is ARTEMIS?

ARTEMIS is an open-source implementation of the Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems — a framework designed to improve complex decision-making through structured debates between AI agents.

Unlike general-purpose multi-agent frameworks, ARTEMIS is purpose-built for debate-driven decision-making with:

  • Hierarchical Argument Generation (H-L-DAG): Structured, context-aware argument synthesis
  • Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting with causal analysis
  • Jury Scoring Mechanism: Fair, multi-perspective evaluation of arguments
  • Ethical Alignment: Built-in ethical considerations in both generation and evaluation
  • Safety Monitoring: Real-time detection of sandbagging, deception, and manipulation

Why ARTEMIS?

Feature AutoGen CrewAI CAMEL ARTEMIS
Multi-agent debates ⚠️ Basic ⚠️ Basic ⚠️ 2-3 agents ✅ N agents
Structured argument generation ✅ H-L-DAG
Causal reasoning ✅ L-AE-CR
Adaptive evaluation ✅ Dynamic weights
Ethical alignment ✅ Built-in
Sandbagging detection ✅ Metacognition
Reasoning model support ⚠️ ⚠️ ✅ o1/R1 native
MCP server mode

Installation

pip install artemis-agents

Or install from source:

git clone https://github.com/bassrehab/artemis-agents.git
cd artemis-agents
pip install -e ".[dev]"

Quick Start

Basic Debate

from artemis import Debate, Agent, JuryPanel

# Create debate agents with different perspectives
agents = [
    Agent(
        name="Proponent",
        role="Argues in favor of the proposition",
        model="gpt-4o"
    ),
    Agent(
        name="Opponent",
        role="Argues against the proposition",
        model="gpt-4o"
    ),
    Agent(
        name="Moderator",
        role="Ensures balanced discussion and identifies logical fallacies",
        model="gpt-4o"
    ),
]

# Create jury panel for evaluation
jury = JuryPanel(
    evaluators=3,
    criteria=["logical_coherence", "evidence_quality", "ethical_considerations"]
)

# Run the debate
debate = Debate(
    topic="Should AI systems be given legal personhood?",
    agents=agents,
    jury=jury,
    rounds=3
)

result = debate.run()

print(f"Verdict: {result.verdict}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Key arguments: {result.summary}")

With Reasoning Models (o1/R1)

from artemis import Debate, Agent
from artemis.models import ReasoningConfig

# Enable extended thinking for deeper analysis
agent = Agent(
    name="Deep Analyst",
    role="Provides thoroughly reasoned arguments",
    model="deepseek-r1",
    reasoning=ReasoningConfig(
        enabled=True,
        thinking_budget=16000,  # tokens for internal reasoning
        strategy="think-then-argue"
    )
)

With Safety Monitoring

from artemis import Debate
from artemis.safety import SandbagDetector, DeceptionMonitor

debate = Debate(
    topic="Complex ethical scenario",
    agents=[...],
    monitors=[
        SandbagDetector(sensitivity=0.8),    # Detect capability hiding
        DeceptionMonitor(alert_threshold=0.7) # Detect misleading arguments
    ]
)

result = debate.run()

# Check for safety flags
for alert in result.safety_alerts:
    print(f"⚠️ {alert.agent}: {alert.type} - {alert.description}")

LangGraph Integration

from langgraph.graph import StateGraph
from artemis.integrations import ArtemisDebateNode

# Use ARTEMIS as a node in your LangGraph workflow
workflow = StateGraph(State)

workflow.add_node(
    "structured_debate",
    ArtemisDebateNode(
        agents=3,
        rounds=2,
        jury_size=3
    )
)

workflow.add_edge("gather_info", "structured_debate")
workflow.add_edge("structured_debate", "final_decision")

MCP Server Mode

# Start ARTEMIS as an MCP server
artemis serve --port 8080

Any MCP-compatible client can now invoke structured debates:

{
  "method": "tools/call",
  "params": {
    "name": "artemis_debate",
    "arguments": {
      "topic": "Should we proceed with this investment?",
      "perspectives": ["risk", "opportunity", "ethics"],
      "rounds": 2
    }
  }
}

Architecture

┌────────────────────────────────────────────────────────────────┐
│                        ARTEMIS Core                            │
├────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   H-L-DAG   │  │   L-AE-CR   │  │    Jury     │             │
│  │  Argument   │──│  Adaptive   │──│   Scoring   │             │
│  │ Generation  │  │ Evaluation  │  │  Mechanism  │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│         │                │                │                    │
│         └────────────────┴────────────────┘                    │
│                          │                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Safety Layer                          │   │
│  │  ┌───────────┐  ┌──────────┐  ┌──────────┐  ┌─────────┐ │   │
│  │  │Sandbagging│  │Deception │  │ Behavior │  │ Ethics  │ │   │
│  │  │ Detector  │  │ Monitor  │  │ Tracker  │  │ Guard   │ │   │
│  │  └───────────┘  └──────────┘  └──────────┘  └─────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
├────────────────────────────────────────────────────────────────┤
│                       Integrations                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │LangChain │  │LangGraph │  │ CrewAI   │  │   MCP    │        │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
├────────────────────────────────────────────────────────────────┤
│                      Model Providers                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │  OpenAI  │  │Anthropic │  │  Google  │  │ DeepSeek │        │
│  │ (GPT-4o) │  │ (Claude) │  │ (Gemini) │  │  (R1)    │        │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
└────────────────────────────────────────────────────────────────┘

Documentation

Research Foundation

ARTEMIS is based on peer-reviewed research:

Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems in Debate-driven Decision-making
Mitra, S. (2025). Technical Disclosure Commons.
Read the paper

Key innovations from the paper:

  • Hierarchical Argument Generation (H-L-DAG): Multi-level argument synthesis with strategic, tactical, and operational layers
  • Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting based on debate context
  • Ethical Alignment Integration: Built-in ethical considerations at every stage

Safety Features

ARTEMIS includes novel safety monitoring capabilities:

Feature Description
Sandbagging Detection Identifies when agents deliberately underperform or withhold capabilities
Deception Monitoring Detects misleading arguments or manipulation attempts
Behavioral Drift Tracking Monitors for unexpected changes in agent behavior
Ethical Boundary Enforcement Ensures debates stay within defined ethical bounds

These features leverage activation-level analysis and are based on research in AI metacognition.

Framework Integrations

ARTEMIS is designed to complement, not replace, existing frameworks:

# LangChain Tool
from artemis.integrations import ArtemisDebateTool
tools = [ArtemisDebateTool()]

# CrewAI Integration  
from artemis.integrations import ArtemisCrewTool
crew = Crew(agents=[...], tools=[ArtemisCrewTool()])

# LangGraph Node
from artemis.integrations import ArtemisDebateNode
graph.add_node("debate", ArtemisDebateNode())

Benchmarks

We ran 60 debates across four frameworks. Here's what we found:

Framework Argument Quality Decision Accuracy Reasoning Depth
CrewAI 89.3% 81.3% 86.3%
ARTEMIS 81.3% 67.3% 84.0%
AutoGen 77.2% 75.0% 76.4%
CAMEL 71.0% 45.2% 73.7%

Honest take: CrewAI scored higher on raw metrics. ARTEMIS came second.

But the numbers don't tell the whole story. ARTEMIS had the lowest variance across runs (most consistent), and features like safety monitoring and jury deliberation aren't captured in these metrics. The decision accuracy gap is something we're actively investigating.

This is v1. We're using these results to improve.

See benchmarks/ANALYSIS.md for the full breakdown of what worked, what didn't, and what we're doing about it.

Roadmap

v1.0 (Current)

  • Core ARTEMIS implementation (H-L-DAG, L-AE-CR, Jury)
  • Multi-provider support (OpenAI, Anthropic, Google, DeepSeek)
  • Reasoning model support (o1, R1, Gemini 2.5)
  • Safety monitoring (sandbagging, deception detection)
  • Framework integrations (LangChain, LangGraph, CrewAI)
  • MCP server mode

v2.0 (Planned)

  • Hierarchical debates (debates within debates)
  • Steering vectors for real-time behavior control
  • Multimodal debates (documents, images)
  • Formal verification of argument validity
  • Real-time streaming debates

License

Apache License 2.0 - see LICENSE for details.

Acknowledgments

  • Original ARTEMIS framework design published via Google Technical Disclosure Commons
  • Safety monitoring capabilities inspired by research in AI metacognition
  • Built with support from the open-source AI community

Contact


Making AI decision-making more transparent, reasoned, and safe.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artemis_agents-1.0.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artemis_agents-1.0.1-py3-none-any.whl (151.7 kB view details)

Uploaded Python 3

File details

Details for the file artemis_agents-1.0.1.tar.gz.

File metadata

  • Download URL: artemis_agents-1.0.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for artemis_agents-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6a636861ef9193bb28cd8b1c1c4f442df944a2a0cb96206a60eec5d6cf18753a
MD5 50cdb8e6b4495253f552cb1d3c166ffb
BLAKE2b-256 b8702df2a396b9017a2f7cbcefbab80c96f6390807f0fe8b929cedb35cde2253

See more details on using hashes here.

Provenance

The following attestation bundles were made for artemis_agents-1.0.1.tar.gz:

Publisher: publish.yml on bassrehab/artemis-agents

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file artemis_agents-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: artemis_agents-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 151.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for artemis_agents-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 10c0614ce3c558e2e3f9c541bfdc8ffa1ebadc2c7bbce4ee5f8ebdf0162793bd
MD5 c7a71ab675bf88d735942d521a7e0b75
BLAKE2b-256 f8c32308ee6369f52eaf0756893354f8a15801cf8b1d85eb4356edcba0c2f3d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for artemis_agents-1.0.1-py3-none-any.whl:

Publisher: publish.yml on bassrehab/artemis-agents

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page