Production-ready multi-agent debate framework with adaptive evaluation and safety monitoring

These details have not been verified by PyPI

Project links

Project description

ARTEMIS Logo

ARTEMIS Agents

Adaptive Reasoning Through Evaluation of Multi-agent Intelligent Systems

A production-ready framework for structured multi-agent debates with adaptive evaluation, causal reasoning, and built-in safety monitoring.

What is ARTEMIS?

ARTEMIS is an open-source implementation of the Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems — a framework designed to improve complex decision-making through structured debates between AI agents.

Unlike general-purpose multi-agent frameworks, ARTEMIS is purpose-built for debate-driven decision-making with:

Hierarchical Argument Generation (H-L-DAG): Structured, context-aware argument synthesis
Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting with causal analysis
Jury Scoring Mechanism: Fair, multi-perspective evaluation of arguments
Ethical Alignment: Built-in ethical considerations in both generation and evaluation
Safety Monitoring: Real-time detection of sandbagging, deception, and manipulation

Why ARTEMIS?

Feature	AutoGen	CrewAI	CAMEL	ARTEMIS
Multi-agent debates	⚠️ Basic	⚠️ Basic	⚠️ 2-3 agents	✅ N agents
Structured argument generation	❌	❌	❌	✅ H-L-DAG
Causal reasoning	❌	❌	❌	✅ L-AE-CR
Adaptive evaluation	❌	❌	❌	✅ Dynamic weights
Ethical alignment	❌	❌	❌	✅ Built-in
Sandbagging detection	❌	❌	❌	✅ Metacognition
Reasoning model support	⚠️	⚠️	❌	✅ o1/R1 native
MCP server mode	❌	❌	❌	✅
Real-time streaming	⚠️	❌	❌	✅ v2
Hierarchical debates	❌	❌	❌	✅ v2
Multimodal evidence	⚠️	⚠️	❌	✅ v2
Steering vectors	❌	❌	❌	✅ v2
Argument verification	❌	❌	❌	✅ v2

Installation

pip install artemis-agents

Or install from source:

git clone https://github.com/bassrehab/artemis-agents.git
cd artemis-agents
pip install -e ".[dev]"

Quick Start

Basic Debate

from artemis import Debate, Agent, JuryPanel

# Create debate agents with different perspectives
agents = [
    Agent(
        name="Proponent",
        role="Argues in favor of the proposition",
        model="gpt-4o"
    ),
    Agent(
        name="Opponent",
        role="Argues against the proposition",
        model="gpt-4o"
    ),
    Agent(
        name="Moderator",
        role="Ensures balanced discussion and identifies logical fallacies",
        model="gpt-4o"
    ),
]

# Create jury panel for evaluation
jury = JuryPanel(
    evaluators=3,
    criteria=["logical_coherence", "evidence_quality", "ethical_considerations"]
)

# Run the debate
debate = Debate(
    topic="Should AI systems be given legal personhood?",
    agents=agents,
    jury=jury,
    rounds=3
)

result = debate.run()

print(f"Verdict: {result.verdict}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Key arguments: {result.summary}")

With Reasoning Models (o1/R1)

from artemis import Debate, Agent
from artemis.models import ReasoningConfig

# Enable extended thinking for deeper analysis
agent = Agent(
    name="Deep Analyst",
    role="Provides thoroughly reasoned arguments",
    model="deepseek-r1",
    reasoning=ReasoningConfig(
        enabled=True,
        thinking_budget=16000,  # tokens for internal reasoning
        strategy="think-then-argue"
    )
)

With Safety Monitoring

from artemis import Debate
from artemis.safety import SandbagDetector, DeceptionMonitor

debate = Debate(
    topic="Complex ethical scenario",
    agents=[...],
    monitors=[
        SandbagDetector(sensitivity=0.8),    # Detect capability hiding
        DeceptionMonitor(alert_threshold=0.7) # Detect misleading arguments
    ]
)

result = debate.run()

# Check for safety flags
for alert in result.safety_alerts:
    print(f"⚠️ {alert.agent}: {alert.type} - {alert.description}")

LangGraph Integration

from langgraph.graph import StateGraph
from artemis.integrations import ArtemisDebateNode

# Use ARTEMIS as a node in your LangGraph workflow
workflow = StateGraph(State)

workflow.add_node(
    "structured_debate",
    ArtemisDebateNode(
        agents=3,
        rounds=2,
        jury_size=3
    )
)

workflow.add_edge("gather_info", "structured_debate")
workflow.add_edge("structured_debate", "final_decision")

MCP Server Mode

# Start ARTEMIS as an MCP server
artemis serve --port 8080

Any MCP-compatible client can now invoke structured debates:

{
  "method": "tools/call",
  "params": {
    "name": "artemis_debate",
    "arguments": {
      "topic": "Should we proceed with this investment?",
      "perspectives": ["risk", "opportunity", "ethics"],
      "rounds": 2
    }
  }
}

Streaming Debates (v2)

from artemis import StreamingDebate

debate = StreamingDebate(
    topic="Should we adopt microservices?",
    agents=[...],
)

# Stream events in real-time
async for event in debate.run_streaming():
    if event.event_type == "chunk":
        print(event.content, end="", flush=True)
    elif event.event_type == "argument_complete":
        print(f"\n[{event.agent}] argument complete")

Hierarchical Debates (v2)

from artemis import HierarchicalDebate
from artemis.core.decomposition import LLMTopicDecomposer

# Complex topics are automatically decomposed into sub-debates
debate = HierarchicalDebate(
    topic="Should we rewrite the monolith in microservices?",
    agents=[...],
    decomposer=LLMTopicDecomposer(),
    max_depth=2,  # Allow sub-sub-debates
)

result = await debate.run()
print(f"Final verdict: {result.final_decision}")
print(f"Sub-verdicts: {len(result.sub_verdicts)}")

Steering Vectors (v2)

from artemis.steering import SteeringController, SteeringVector

# Control agent behavior in real-time
controller = SteeringController(
    vector=SteeringVector(
        formality=0.9,      # Very formal
        aggression=0.2,     # Cooperative
        evidence_emphasis=0.8,  # Data-driven
    )
)

agent = Agent(
    name="Analyst",
    steering=controller,
)

Multimodal Evidence (v2)

from artemis.core.multimodal_evidence import MultimodalEvidenceExtractor
from artemis.core.types import ContentPart, ContentType

# Extract evidence from images and documents
extractor = MultimodalEvidenceExtractor(model="gpt-4o")

chart = ContentPart(
    type=ContentType.IMAGE,
    url="https://example.com/revenue-chart.png"
)

evidence = await extractor.extract(chart)
print(f"Extracted: {evidence.text}")

Argument Verification (v2)

from artemis.core.verification import ArgumentVerifier, VerificationSpec

# Verify argument validity
verifier = ArgumentVerifier(
    spec=VerificationSpec(
        rules=["causal_chain", "citation", "logical_consistency"],
        strict_mode=True,
    )
)

report = await verifier.verify(argument, context)
if not report.overall_passed:
    print(f"Violations: {report.violations}")

Architecture

┌────────────────────────────────────────────────────────────────┐
│                        ARTEMIS Core                            │
├────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   H-L-DAG   │  │   L-AE-CR   │  │    Jury     │             │
│  │  Argument   │──│  Adaptive   │──│   Scoring   │             │
│  │ Generation  │  │ Evaluation  │  │  Mechanism  │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│         │                │                │                    │
│         └────────────────┴────────────────┘                    │
│                          │                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Safety Layer                          │   │
│  │  ┌───────────┐  ┌──────────┐  ┌──────────┐  ┌─────────┐ │   │
│  │  │Sandbagging│  │Deception │  │ Behavior │  │ Ethics  │ │   │
│  │  │ Detector  │  │ Monitor  │  │ Tracker  │  │ Guard   │ │   │
│  │  └───────────┘  └──────────┘  └──────────┘  └─────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
├────────────────────────────────────────────────────────────────┤
│                       Integrations                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │LangChain │  │LangGraph │  │ CrewAI   │  │   MCP    │        │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
├────────────────────────────────────────────────────────────────┤
│                      Model Providers                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │  OpenAI  │  │Anthropic │  │  Google  │  │ DeepSeek │        │
│  │ (GPT-4o) │  │ (Claude) │  │ (Gemini) │  │  (R1)    │        │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
└────────────────────────────────────────────────────────────────┘

Documentation

Full Documentation - Guides, API reference, examples
Examples - Real-world usage examples
Contributing - How to contribute

Research Foundation

ARTEMIS is based on peer-reviewed research:

Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems in Debate-driven Decision-making
Mitra, S. (2025). Technical Disclosure Commons.
Read the paper

Key innovations from the paper:

Hierarchical Argument Generation (H-L-DAG): Multi-level argument synthesis with strategic, tactical, and operational layers
Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting based on debate context
Ethical Alignment Integration: Built-in ethical considerations at every stage

Safety Features

ARTEMIS includes novel safety monitoring capabilities:

Feature	Description
Sandbagging Detection	Identifies when agents deliberately underperform or withhold capabilities
Deception Monitoring	Detects misleading arguments or manipulation attempts
Behavioral Drift Tracking	Monitors for unexpected changes in agent behavior
Ethical Boundary Enforcement	Ensures debates stay within defined ethical bounds

These features leverage activation-level analysis and are based on research in AI metacognition.

Framework Integrations

ARTEMIS is designed to complement, not replace, existing frameworks:

# LangChain Tool
from artemis.integrations import ArtemisDebateTool
tools = [ArtemisDebateTool()]

# CrewAI Integration  
from artemis.integrations import ArtemisCrewTool
crew = Crew(agents=[...], tools=[ArtemisCrewTool()])

# LangGraph Node
from artemis.integrations import ArtemisDebateNode
graph.add_node("debate", ArtemisDebateNode())

Benchmarks

We ran 27 debates across three frameworks using GPT-4o. Here's what we found:

Framework	Argument Quality	Decision Accuracy	Reasoning Depth	Consistency (σ)
ARTEMIS	77.9%	86.0%	75.3%	±1.6
AutoGen	77.3%	55.0%	74.7%	±0.5
CrewAI	75.1%	42.8%	57.0%	±16.0

Key findings:

ARTEMIS leads in decision accuracy (86% vs next best 55%) - the jury deliberation mechanism works
Lowest variance across runs (±1.6 vs CrewAI's ±16.0) - most consistent and predictable
Structured H-L-DAG arguments produce reliable reasoning depth

Trade-off: ARTEMIS averages 102s per debate vs AutoGen's 36s. The jury deliberation adds latency but improves verdict quality.

See benchmarks/ANALYSIS.md for methodology and detailed breakdown.

Roadmap

v1.0

Core ARTEMIS implementation (H-L-DAG, L-AE-CR, Jury)
Multi-provider support (OpenAI, Anthropic, Google, DeepSeek)
Reasoning model support (o1, R1, Gemini 2.5)
Safety monitoring (sandbagging, deception detection)
Framework integrations (LangChain, LangGraph, CrewAI)
MCP server mode

v2.0 (Current)

Hierarchical debates (sub-debates for complex topics)
Steering vectors for real-time behavior control
Multimodal debates (images, documents, charts)
Formal verification of argument validity
Real-time streaming debates

v3.0 (Planned)

Distributed debate execution
Custom evaluation plugins
Debate templates and presets
Advanced causal graph visualization

License

Apache License 2.0 - see LICENSE for details.

Acknowledgments

Original ARTEMIS framework design published via Google Technical Disclosure Commons
Safety monitoring capabilities inspired by research in AI metacognition
Built with support from the open-source AI community

Contact

Author: Subhadip Mitra
GitHub: @bassrehab
Twitter/X: @bassrehab

Making AI decision-making more transparent, reasoned, and safe.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Dec 29, 2025

1.0.1

Dec 28, 2025

1.0.0

Dec 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artemis_agents-2.0.0.tar.gz (1.6 MB view details)

Uploaded Dec 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

artemis_agents-2.0.0-py3-none-any.whl (236.5 kB view details)

Uploaded Dec 29, 2025 Python 3

File details

Details for the file artemis_agents-2.0.0.tar.gz.

File metadata

Download URL: artemis_agents-2.0.0.tar.gz
Upload date: Dec 29, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for artemis_agents-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`2bfcc7541937f588d11d44ba32a372c5859ebae986e392131f5245b904f5c2b4`
MD5	`4385d2e253eb75b5745964cb26ef39ff`
BLAKE2b-256	`dab67989820b3122b55358235767a4890c3fc9d0a50f6f7343868acc59bb8a9c`

See more details on using hashes here.

File details

Details for the file artemis_agents-2.0.0-py3-none-any.whl.

File metadata

Download URL: artemis_agents-2.0.0-py3-none-any.whl
Upload date: Dec 29, 2025
Size: 236.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for artemis_agents-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18ea4e7fc24732b33067714cee89d12c6bf3fc26853a2c13fc87d598a535e1fb`
MD5	`f5aee1ef0c245334a3086ba9caac0df0`
BLAKE2b-256	`97d3900ea2f8afdd7e721df59439a86875d952a97916a9f56a023cd5e186e527`

See more details on using hashes here.

artemis-agents 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARTEMIS Agents

What is ARTEMIS?

Why ARTEMIS?

Installation

Quick Start

Basic Debate

With Reasoning Models (o1/R1)

With Safety Monitoring

LangGraph Integration

MCP Server Mode

Streaming Debates (v2)

Hierarchical Debates (v2)

Steering Vectors (v2)

Multimodal Evidence (v2)

Argument Verification (v2)

Architecture

Documentation

Research Foundation

Safety Features

Framework Integrations

Benchmarks

Roadmap

v1.0

v2.0 (Current)

v3.0 (Planned)

License

Acknowledgments

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes