Skip to main content

Evaluate log reduction tools against the LogDx-CI corpus (35 real CI-failure cases) — pip install + 5-line Python = score vs 12 reference methods.

Project description

logdx-ci

Evaluation harness for log reduction tools targeting LLM root-cause diagnosis on CI failures. Wraps the LogDx-CI v1.2 corpus (35 real GitHub Actions failure cases, AI-drafted + author-verified ground truth) into a five-minute Python API.

PyPI arXiv License

Install

pip install logdx-ci

The corpus + scoring code (~20 MB) is auto-fetched from the LogDx GitHub release on first use and cached at ~/.logdx_ci_cache/repo/. No clone required.

For the LLM-based diagnosers (real-debugger-v1/v2/v3) you also need either the claude CLI on PATH (Haiku / Sonnet) or OPENAI_API_KEY (gpt-5-mini). The default static-signal-recall diagnoser needs neither — runs deterministic, free, in under a second.

Five-minute tutorial

import logdx_ci

# 1. Define your log reducer
def my_reducer(raw_log: str) -> str:
    """Toy: keep only lines containing 'error'."""
    return "\n".join(
        line for line in raw_log.split("\n")
        if "error" in line.lower()
    )

# 2. Evaluate on the corpus (default = static, no LLM, no API key, <1s)
result = logdx_ci.evaluate(
    reducer=my_reducer,
    # diagnoser defaults to "static-signal-recall"
    # splits defaults to all 6 (= 35 cases)
)

# 3. Inspect
print(result.summary())

Output:

LogDx-CI evaluation result
  diagnoser:           static-signal-recall
  cases evaluated:     35
  critical_signal_recall: 0.7536
  mean reduced chars:  3,053
  elapsed:             0.05 sec
  closest baseline:    tail (0.754, +0.000)

method                                   csr      tokens  note
--------------------------------------------------------------------------------
**YOU**                               0.7536       3,053
raw                                   0.9649           —  +0.211 vs you
rtk-read                              0.9649           —  +0.211 vs you
grep                                  0.8411           —  +0.087 vs you
hybrid-grep-120k-rtk-tail-v3          0.8225           —  +0.069 vs you
hybrid-grep-120k-tail-v2              0.8189           —  +0.065 vs you
llm-summary-v1-gpt-5-mini             0.8104           —  +0.057 vs you
tail                                  0.7536           —  +0.000 vs you
llm-summary-v1-haiku                  0.7009           —  -0.053 vs you
hybrid-grep-4k-rtk-err-cat-v1         0.6810           —  -0.073 vs you
rtk-err-cat                           0.5372           —  -0.216 vs you
rtk-log                               0.1819           —  -0.572 vs you

Use the real diagnoser

import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

result = logdx_ci.evaluate(
    reducer=my_reducer,
    diagnoser="real-debugger-v2",   # Claude Sonnet 4.6
)

Cost preview (per case, at 2026-05-20 pricing): ~$0.03 for an average reduced context (~20k tokens). Full 35-case eval ≈ $1.05 + your reducer's own cost.

Command-line

# Define your reducer as `reduce` in a Python file:
cat > my_reducer.py << 'EOF'
def reduce(log):
    return log[-2000:]
EOF

# Evaluate
logdx-ci eval --reducer my_reducer.py --diagnoser stub-debugger-v1 --splits v2/dev

Supported diagnosers

Name What it measures Auth Speed Cost
static-signal-recall (default) Did the reducer preserve required signals? (text-only, no LLM) none <1s / 35 cases $0
stub-debugger-v1 Smoke test only (deterministic regex stub) none <1s / 35 cases $0
real-debugger-v1 Did Haiku 4.5 give a correct diagnosis from the reduced context? claude CLI logged in ~3s / case ~$0.005 / case
real-debugger-v2 Did Sonnet 4.6 give a correct diagnosis? claude CLI logged in ~10s / case ~$0.03 / case
real-debugger-v3 Did gpt-5-mini give a correct diagnosis? OPENAI_API_KEY ~10s / case ~$0.006 / case
real-agent-v1 Sonnet 4.6 + 4 tools (grep/read_file/tail/view_log_lines) on raw.log, 5-turn cap ANTHROPIC_API_KEY or OPENROUTER_API_KEY ~20s / case ~$0.10–0.15 / case

Recommended workflow: prototype with static-signal-recall (free, deterministic, 50ms for 35 cases) → confirm pipeline → spend $1 on real-debugger-v2 for leaderboard-comparable single-shot scores → if your end use case is an LLM agent (Claude Code / Cursor / Cline / Aider), run real-agent-v1 ($3-5) for the more realistic agent-loop score.

Caching

By default, diagnosis results are cached at ~/.logdx_ci_cache/diagnosis/ keyed by (diagnoser, case_id, reduced_context_hash). Re-running the same reducer is free.

Citing

@article{qin2026logdx,
  title         = {{LogDx-CI}: Benchmarking Log Reduction Tools
                  for LLM Root-Cause Diagnosis},
  author        = {Qin, Bowen},
  year          = {2026},
  eprint        = {2605.28876},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logdx_ci-0.5.0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logdx_ci-0.5.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file logdx_ci-0.5.0.tar.gz.

File metadata

  • Download URL: logdx_ci-0.5.0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for logdx_ci-0.5.0.tar.gz
Algorithm Hash digest
SHA256 4f5bd04dc7daffbb8247c147b20403bd41d66869462488e4a571dbe2b28e8c17
MD5 2f5471b5ec89b0966ff3258d2b3c0e01
BLAKE2b-256 58d14e6db141fb3a1e80cc968752c2fd33405fe40a29264c5146af3ca14dddab

See more details on using hashes here.

File details

Details for the file logdx_ci-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: logdx_ci-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for logdx_ci-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b22518665603a7b369e8bd1d88af0b3852ad47ef87554f347f06b7863817c3e8
MD5 5dd960dba6e518b3652d7a482e0bc369
BLAKE2b-256 7c805444dc9fabd6673768e64811030e67df798e9c650b2e56c5f8e34ca6020c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page