Production-ready multi-agent debate framework with adaptive evaluation and safety monitoring
Project description
ARTEMIS Agents
Adaptive Reasoning Through Evaluation of Multi-agent Intelligent Systems
A production-ready framework for structured multi-agent debates with adaptive evaluation, causal reasoning, and built-in safety monitoring.
What is ARTEMIS?
ARTEMIS is an open-source implementation of the Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems — a framework designed to improve complex decision-making through structured debates between AI agents.
Unlike general-purpose multi-agent frameworks, ARTEMIS is purpose-built for debate-driven decision-making with:
- Hierarchical Argument Generation (H-L-DAG): Structured, context-aware argument synthesis
- Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting with causal analysis
- Jury Scoring Mechanism: Fair, multi-perspective evaluation of arguments
- Ethical Alignment: Built-in ethical considerations in both generation and evaluation
- Safety Monitoring: Real-time detection of sandbagging, deception, and manipulation
Why ARTEMIS?
| Feature | AutoGen | CrewAI | CAMEL | ARTEMIS |
|---|---|---|---|---|
| Multi-agent debates | ⚠️ Basic | ⚠️ Basic | ⚠️ 2-3 agents | ✅ N agents |
| Structured argument generation | ❌ | ❌ | ❌ | ✅ H-L-DAG |
| Causal reasoning | ❌ | ❌ | ❌ | ✅ L-AE-CR |
| Adaptive evaluation | ❌ | ❌ | ❌ | ✅ Dynamic weights |
| Ethical alignment | ❌ | ❌ | ❌ | ✅ Built-in |
| Sandbagging detection | ❌ | ❌ | ❌ | ✅ Metacognition |
| Reasoning model support | ⚠️ | ⚠️ | ❌ | ✅ o1/R1 native |
| MCP server mode | ❌ | ❌ | ❌ | ✅ |
| Real-time streaming | ⚠️ | ❌ | ❌ | ✅ v2 |
| Hierarchical debates | ❌ | ❌ | ❌ | ✅ v2 |
| Multimodal evidence | ⚠️ | ⚠️ | ❌ | ✅ v2 |
| Steering vectors | ❌ | ❌ | ❌ | ✅ v2 |
| Argument verification | ❌ | ❌ | ❌ | ✅ v2 |
Installation
pip install artemis-agents
Or install from source:
git clone https://github.com/bassrehab/artemis-agents.git
cd artemis-agents
pip install -e ".[dev]"
Quick Start
Basic Debate
from artemis import Debate, Agent, JuryPanel
# Create debate agents with different perspectives
agents = [
Agent(
name="Proponent",
role="Argues in favor of the proposition",
model="gpt-4o"
),
Agent(
name="Opponent",
role="Argues against the proposition",
model="gpt-4o"
),
Agent(
name="Moderator",
role="Ensures balanced discussion and identifies logical fallacies",
model="gpt-4o"
),
]
# Create jury panel for evaluation
jury = JuryPanel(
evaluators=3,
criteria=["logical_coherence", "evidence_quality", "ethical_considerations"]
)
# Run the debate
debate = Debate(
topic="Should AI systems be given legal personhood?",
agents=agents,
jury=jury,
rounds=3
)
result = debate.run()
print(f"Verdict: {result.verdict}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Key arguments: {result.summary}")
With Reasoning Models (o1/R1)
from artemis import Debate, Agent
from artemis.models import ReasoningConfig
# Enable extended thinking for deeper analysis
agent = Agent(
name="Deep Analyst",
role="Provides thoroughly reasoned arguments",
model="deepseek-r1",
reasoning=ReasoningConfig(
enabled=True,
thinking_budget=16000, # tokens for internal reasoning
strategy="think-then-argue"
)
)
With Safety Monitoring
from artemis import Debate
from artemis.safety import SandbagDetector, DeceptionMonitor
debate = Debate(
topic="Complex ethical scenario",
agents=[...],
monitors=[
SandbagDetector(sensitivity=0.8), # Detect capability hiding
DeceptionMonitor(alert_threshold=0.7) # Detect misleading arguments
]
)
result = debate.run()
# Check for safety flags
for alert in result.safety_alerts:
print(f"⚠️ {alert.agent}: {alert.type} - {alert.description}")
LangGraph Integration
from langgraph.graph import StateGraph
from artemis.integrations import ArtemisDebateNode
# Use ARTEMIS as a node in your LangGraph workflow
workflow = StateGraph(State)
workflow.add_node(
"structured_debate",
ArtemisDebateNode(
agents=3,
rounds=2,
jury_size=3
)
)
workflow.add_edge("gather_info", "structured_debate")
workflow.add_edge("structured_debate", "final_decision")
MCP Server Mode
# Start ARTEMIS as an MCP server
artemis serve --port 8080
Any MCP-compatible client can now invoke structured debates:
{
"method": "tools/call",
"params": {
"name": "artemis_debate",
"arguments": {
"topic": "Should we proceed with this investment?",
"perspectives": ["risk", "opportunity", "ethics"],
"rounds": 2
}
}
}
Streaming Debates (v2)
from artemis import StreamingDebate
debate = StreamingDebate(
topic="Should we adopt microservices?",
agents=[...],
)
# Stream events in real-time
async for event in debate.run_streaming():
if event.event_type == "chunk":
print(event.content, end="", flush=True)
elif event.event_type == "argument_complete":
print(f"\n[{event.agent}] argument complete")
Hierarchical Debates (v2)
from artemis import HierarchicalDebate
from artemis.core.decomposition import LLMTopicDecomposer
# Complex topics are automatically decomposed into sub-debates
debate = HierarchicalDebate(
topic="Should we rewrite the monolith in microservices?",
agents=[...],
decomposer=LLMTopicDecomposer(),
max_depth=2, # Allow sub-sub-debates
)
result = await debate.run()
print(f"Final verdict: {result.final_decision}")
print(f"Sub-verdicts: {len(result.sub_verdicts)}")
Steering Vectors (v2)
from artemis.steering import SteeringController, SteeringVector
# Control agent behavior in real-time
controller = SteeringController(
vector=SteeringVector(
formality=0.9, # Very formal
aggression=0.2, # Cooperative
evidence_emphasis=0.8, # Data-driven
)
)
agent = Agent(
name="Analyst",
steering=controller,
)
Multimodal Evidence (v2)
from artemis.core.multimodal_evidence import MultimodalEvidenceExtractor
from artemis.core.types import ContentPart, ContentType
# Extract evidence from images and documents
extractor = MultimodalEvidenceExtractor(model="gpt-4o")
chart = ContentPart(
type=ContentType.IMAGE,
url="https://example.com/revenue-chart.png"
)
evidence = await extractor.extract(chart)
print(f"Extracted: {evidence.text}")
Argument Verification (v2)
from artemis.core.verification import ArgumentVerifier, VerificationSpec
# Verify argument validity
verifier = ArgumentVerifier(
spec=VerificationSpec(
rules=["causal_chain", "citation", "logical_consistency"],
strict_mode=True,
)
)
report = await verifier.verify(argument, context)
if not report.overall_passed:
print(f"Violations: {report.violations}")
Architecture
┌────────────────────────────────────────────────────────────────┐
│ ARTEMIS Core │
├────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ H-L-DAG │ │ L-AE-CR │ │ Jury │ │
│ │ Argument │──│ Adaptive │──│ Scoring │ │
│ │ Generation │ │ Evaluation │ │ Mechanism │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Safety Layer │ │
│ │ ┌───────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │Sandbagging│ │Deception │ │ Behavior │ │ Ethics │ │ │
│ │ │ Detector │ │ Monitor │ │ Tracker │ │ Guard │ │ │
│ │ └───────────┘ └──────────┘ └──────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────┤
│ Integrations │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │LangChain │ │LangGraph │ │ CrewAI │ │ MCP │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
├────────────────────────────────────────────────────────────────┤
│ Model Providers │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ OpenAI │ │Anthropic │ │ Google │ │ DeepSeek │ │
│ │ (GPT-4o) │ │ (Claude) │ │ (Gemini) │ │ (R1) │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────┘
Documentation
- Full Documentation - Guides, API reference, examples
- Examples - Real-world usage examples
- Contributing - How to contribute
Research Foundation
ARTEMIS is based on peer-reviewed research:
Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems in Debate-driven Decision-making
Mitra, S. (2025). Technical Disclosure Commons.
Read the paper
Key innovations from the paper:
- Hierarchical Argument Generation (H-L-DAG): Multi-level argument synthesis with strategic, tactical, and operational layers
- Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting based on debate context
- Ethical Alignment Integration: Built-in ethical considerations at every stage
Safety Features
ARTEMIS includes novel safety monitoring capabilities:
| Feature | Description |
|---|---|
| Sandbagging Detection | Identifies when agents deliberately underperform or withhold capabilities |
| Deception Monitoring | Detects misleading arguments or manipulation attempts |
| Behavioral Drift Tracking | Monitors for unexpected changes in agent behavior |
| Ethical Boundary Enforcement | Ensures debates stay within defined ethical bounds |
These features leverage activation-level analysis and are based on research in AI metacognition.
Framework Integrations
ARTEMIS is designed to complement, not replace, existing frameworks:
# LangChain Tool
from artemis.integrations import ArtemisDebateTool
tools = [ArtemisDebateTool()]
# CrewAI Integration
from artemis.integrations import ArtemisCrewTool
crew = Crew(agents=[...], tools=[ArtemisCrewTool()])
# LangGraph Node
from artemis.integrations import ArtemisDebateNode
graph.add_node("debate", ArtemisDebateNode())
Benchmarks
We ran 27 debates across three frameworks using GPT-4o. Here's what we found:
| Framework | Argument Quality | Decision Accuracy | Reasoning Depth | Consistency (σ) |
|---|---|---|---|---|
| ARTEMIS | 77.9% | 86.0% | 75.3% | ±1.6 |
| AutoGen | 77.3% | 55.0% | 74.7% | ±0.5 |
| CrewAI | 75.1% | 42.8% | 57.0% | ±16.0 |
Key findings:
- ARTEMIS leads in decision accuracy (86% vs next best 55%) - the jury deliberation mechanism works
- Lowest variance across runs (±1.6 vs CrewAI's ±16.0) - most consistent and predictable
- Structured H-L-DAG arguments produce reliable reasoning depth
Trade-off: ARTEMIS averages 102s per debate vs AutoGen's 36s. The jury deliberation adds latency but improves verdict quality.
See benchmarks/ANALYSIS.md for methodology and detailed breakdown.
Roadmap
v1.0
- Core ARTEMIS implementation (H-L-DAG, L-AE-CR, Jury)
- Multi-provider support (OpenAI, Anthropic, Google, DeepSeek)
- Reasoning model support (o1, R1, Gemini 2.5)
- Safety monitoring (sandbagging, deception detection)
- Framework integrations (LangChain, LangGraph, CrewAI)
- MCP server mode
v2.0 (Current)
- Hierarchical debates (sub-debates for complex topics)
- Steering vectors for real-time behavior control
- Multimodal debates (images, documents, charts)
- Formal verification of argument validity
- Real-time streaming debates
v3.0 (Planned)
- Distributed debate execution
- Custom evaluation plugins
- Debate templates and presets
- Advanced causal graph visualization
License
Apache License 2.0 - see LICENSE for details.
Acknowledgments
- Original ARTEMIS framework design published via Google Technical Disclosure Commons
- Safety monitoring capabilities inspired by research in AI metacognition
- Built with support from the open-source AI community
Contact
- Author: Subhadip Mitra
- GitHub: @bassrehab
- Twitter/X: @bassrehab
Making AI decision-making more transparent, reasoned, and safe.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file artemis_agents-2.0.0.tar.gz.
File metadata
- Download URL: artemis_agents-2.0.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bfcc7541937f588d11d44ba32a372c5859ebae986e392131f5245b904f5c2b4
|
|
| MD5 |
4385d2e253eb75b5745964cb26ef39ff
|
|
| BLAKE2b-256 |
dab67989820b3122b55358235767a4890c3fc9d0a50f6f7343868acc59bb8a9c
|
File details
Details for the file artemis_agents-2.0.0-py3-none-any.whl.
File metadata
- Download URL: artemis_agents-2.0.0-py3-none-any.whl
- Upload date:
- Size: 236.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18ea4e7fc24732b33067714cee89d12c6bf3fc26853a2c13fc87d598a535e1fb
|
|
| MD5 |
f5aee1ef0c245334a3086ba9caac0df0
|
|
| BLAKE2b-256 |
97d3900ea2f8afdd7e721df59439a86875d952a97916a9f56a023cd5e186e527
|