Skip to main content

Evaluation-Driven Development toolkit for OpenClaw agents

Project description

openclaw-edd

Evaluation-Driven Development toolkit for OpenClaw agents

零摩擦 EDD 工具包,不侵入 OpenClaw 本体,唯一数据源是日志文件。

特性

  • 零配置 - pip install openclaw-edd && edd watch 即用
  • 零侵入 - 不需要改 OpenClaw 配置,不需要重启 Gateway
  • 零依赖 - 核心功能无需任何外部库(PyYAML 可选)
  • 完整闭环 - watch → run → suggest → apply → diff → mine → export

快速开始

# 安装
pip install openclaw-edd

# 实时观测
edd watch

# 标准 EDD 闭环
edd run --cases cases.yaml --output-json report_v1.json
edd edd suggest --report report_v1.json > suggestion.txt
edd edd apply --suggestion-file suggestion.txt
edd run --cases cases.yaml --output-json report_v2.json
edd edd diff --before report_v1.json --after report_v2.json

# 从生产日志挖掘 golden cases
edd edd mine --output mined_cases.yaml
edd run --cases mined_cases.yaml --output-json regression.json

# 导出 golden dataset
edd edd export --output golden.jsonl
edd edd export --format csv --output review.csv

命令列表

核心命令

  • watch - 实时监听日志,打印 tool 事件流
  • trace - 回放历史事件链
  • state - 查看/修改 session 状态
  • artifacts - 管理 tool 输出文件
  • sessions - 列出/查看历史 session
  • run - 运行评测用例集
  • gen-cases - 生成用例模板

EDD 闭环命令

  • edd suggest - 从失败 cases 生成修改建议
  • edd apply - 应用建议到 workspace
  • edd diff - 对比两次 run 的变化
  • edd mine - 从历史日志挖掘 golden cases
  • edd judge - 用 LLM 对 tool 选择和 output 质量打分
  • edd export - 导出 golden dataset(JSONL/CSV)

详细用法

watch - 实时监听

# 基本用法
edd watch

# 过滤特定 session
edd watch --session <session_id前缀>

# 从文件头读(回放今天历史)
edd watch --from-start

# 后台运行
edd watch --daemon
kill $(cat /tmp/openclaw_edd_watch.pid)

run - 运行评测

# 使用内置用例
edd run

# 使用自定义用例
edd run --cases cases.yaml

# 过滤 tags
edd run --cases cases.yaml --tags smoke,mysql

# 单个用例(命令行指定)
edd run --case "今天上海天气" --expect-tools get_weather

# 显示详细的工具调用 trace
edd run --cases cases.yaml --show-trace

# 使用 --local 模式(确保日志写入本地)
edd run --cases cases.yaml --agent main --local

# 输出报告
edd run --output-json report.json
edd run --output-html report.html

# Dry run(不发消息,只解析日志)
edd run --dry-run

edd suggest - 生成建议

edd edd suggest --report report.json
edd edd suggest --report report.json --workspace ~/.openclaw/workspace
edd edd suggest --report report.json > suggestion.txt

edd diff - 对比变化

edd edd diff --before report_v1.json --after report_v2.json
edd edd diff --before report_v1.json --after report_v2.json --format json

edd mine - 挖掘用例

edd edd mine
edd edd mine --output mined_cases.yaml
edd edd mine --min-tools 2

edd judge - LLM 评估

# 使用 LLM 对测试结果进行智能评估
export ANTHROPIC_API_KEY=your_key
edd edd judge --report report.json
edd edd judge --report report.json --output judged_report.json
edd edd judge --report report.json --model claude-opus-4-6

# 查看详细文档
cat docs/JUDGE_COMMAND.md

edd export - 导出 dataset

# 导出 JSONL
edd edd export --output golden.jsonl

# 结合 run report 补充更准确的 golden_output
edd run --cases cases.yaml --output-json report.json
edd edd export --merge-report report.json --output golden.jsonl

# 导出 CSV 给专家人工审查
edd edd export --format csv --output review.csv

用例格式

cases:
  - id: mysql_slow_query
    message: "MySQL 最近有慢查询吗"
    eval_type: regression          # "regression" (防退步) | "capability" (能力爬坡),默认 regression
    expect_tools:
      - query_metrics
      - get_alerts
    expect_tools_ordered:
      - query_metrics
      - get_alerts
    expect_output_contains:
      - "慢查询"
    forbidden_tools:
      - execute_sql
    expect_tool_args:              # 工具参数断言(White-box 评测)
      query_metrics:
        time_range: "1h"           # 精确匹配:实际调用必须包含此参数且值相等
        metric: "p99_latency"      # 未指定的参数不检查
    agent: openclaw_agent
    timeout_s: 30
    tags: [mysql, sre]
    description: "MySQL 慢查询排查基础验证"

Eval Type 说明

  • regression: 防退步评测,从接近 100% 开始,任何下降都是报警信号
  • capability: 能力爬坡评测,从低通过率开始,用来测试 agent 还不会做的事

运行报告会按 eval_type 分组显示:

📊 Regression Eval(防退步)
通过: 8/10  (80%)  ← 低于 100% 需要关注
FAIL: mysql_slow_query, mysql_alert_check

📈 Capability Eval(能力爬坡)
通过: 3/8  (37.5%)  ← 正常,这是爬坡指标
PASS: mysql_basic_query ...

Golden Dataset 格式

{
  "id": "50a359b5_1",
  "description": "从 session 50a359b5 提取,2026-02-28",
  "source": "mined",
  "tags": ["mined"],
  "conversation": [
    {
      "turn": 1,
      "user": "MySQL 最近有慢查询吗",
      "golden_tool_sequence": [
        {
          "name": "query_metrics",
          "args": {"metric": "p99_latency", "time_range": "1h"},
          "output_summary": "P99 延迟 120ms,超过阈值"
        }
      ],
      "golden_output": "检测到 MySQL 慢查询,P99 延迟 120ms",
      "assert": [
        {"type": "tool_called", "value": "query_metrics"},
        {"type": "tool_args", "tool": "query_metrics", "args": {"metric": "p99_latency", "time_range": "1h"}},
        {"type": "contains", "value": "慢查询"}
      ]
    }
  ],
  "metadata": {
    "session_id": "50a359b5-184f-4c73-913d-3b53ebbdf109",
    "agent": "openclaw_agent",
    "extracted_at": "2026-02-28T16:00:00",
    "skill_triggered": "skills/mysql_sre.md"
  }
}

数据源

  • 日志位置: /tmp/openclaw/openclaw-YYYY-MM-DD.log
  • 格式: JSON Lines
  • State: ~/.openclaw_eval/state/<session_id>.json
  • Artifacts: ~/.openclaw_eval/artifacts/<session_id>/

Workspace 路径解析

优先级:

  1. --workspace 参数
  2. ~/.openclaw/openclaw.jsonagents.defaults.workspace
  3. Fallback: ~/.openclaw/workspace

依赖说明

  • 零强制依赖 - 核心功能无需任何外部库
  • 可选依赖:
    • PyYAML(仅在使用 --cases 时需要)
    • anthropic(仅在使用 edd judge 时需要)
pip install openclaw-edd[yaml]  # 安装 YAML 支持
pip install openclaw-edd         # 包含 anthropic SDK

平台支持

  • Linux/macOS - 完整支持(包括 daemon 模式)
  • Windows - 支持除 daemon 外的所有功能

CI 集成

# 运行评测
edd run --cases cases.yaml --output-json report.json

# 检查退出码
if [ $? -ne 0 ]; then
  echo "评测失败"
  exit 1
fi

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openclaw_edd-0.1.1.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openclaw_edd-0.1.1-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file openclaw_edd-0.1.1.tar.gz.

File metadata

  • Download URL: openclaw_edd-0.1.1.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for openclaw_edd-0.1.1.tar.gz
Algorithm Hash digest
SHA256 880a3b2a209c8328403c7527195a39be2224e059fa3c70009b25b2db66c34098
MD5 d495209ef3138695e3ed60bbc2ffd0c9
BLAKE2b-256 df1d3593c9a40d2d31de0e685379045b70b2e0fcfeb40d0039eeb345b247e285

See more details on using hashes here.

File details

Details for the file openclaw_edd-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: openclaw_edd-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for openclaw_edd-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e5b47e914da4c2245e1ed668d2bfa723d7404e59649b3db5650f7b1560b356a4
MD5 7c8882fe931347d87259c7309ca66a17
BLAKE2b-256 d5d7f5b4c2cbadd7e81d54de112dbbc564bf8403ea83879b1cbda527072dc016

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page