Skip to main content

Evaluate log reduction tools against the LogDx-CI corpus (35 real CI-failure cases) — pip install + 5-line Python = score vs 12 reference methods.

Project description

logdx-ci

Evaluation harness for log reduction tools targeting LLM root-cause diagnosis on CI failures. Wraps the LogDx-CI v1.2 corpus (35 real GitHub Actions failure cases, AI-drafted + author-verified ground truth) into a five-minute Python API.

arXiv License

Install

pip install logdx-ci

The corpus + scoring code (~20 MB) is auto-fetched from the LogDx GitHub release on first use and cached at ~/.logdx_ci_cache/repo/. No clone required.

For the LLM-based diagnosers (real-debugger-v1/v2/v3) you also need either the claude CLI on PATH (Haiku / Sonnet) or OPENAI_API_KEY (gpt-5-mini). The default static-signal-recall diagnoser needs neither — runs deterministic, free, in under a second.

Five-minute tutorial

import logdx_ci

# 1. Define your log reducer
def my_reducer(raw_log: str) -> str:
    """Toy: keep only lines containing 'error'."""
    return "\n".join(
        line for line in raw_log.split("\n")
        if "error" in line.lower()
    )

# 2. Evaluate on the corpus (default = static, no LLM, no API key, <1s)
result = logdx_ci.evaluate(
    reducer=my_reducer,
    # diagnoser defaults to "static-signal-recall"
    # splits defaults to all 6 (= 35 cases)
)

# 3. Inspect
print(result.summary())

Output:

LogDx-CI evaluation result
  diagnoser:           static-signal-recall
  cases evaluated:     35
  critical_signal_recall: 0.7536
  mean reduced chars:  3,053
  elapsed:             0.05 sec
  closest baseline:    tail (0.754, +0.000)

method                                   csr      tokens  note
--------------------------------------------------------------------------------
**YOU**                               0.7536       3,053
raw                                   0.9649           —  +0.211 vs you
rtk-read                              0.9649           —  +0.211 vs you
grep                                  0.8411           —  +0.087 vs you
hybrid-grep-120k-rtk-tail-v3          0.8225           —  +0.069 vs you
hybrid-grep-120k-tail-v2              0.8189           —  +0.065 vs you
llm-summary-v1-gpt-5-mini             0.8104           —  +0.057 vs you
tail                                  0.7536           —  +0.000 vs you
llm-summary-v1-haiku                  0.7009           —  -0.053 vs you
hybrid-grep-4k-rtk-err-cat-v1         0.6810           —  -0.073 vs you
rtk-err-cat                           0.5372           —  -0.216 vs you
rtk-log                               0.1819           —  -0.572 vs you

Use the real diagnoser

import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

result = logdx_ci.evaluate(
    reducer=my_reducer,
    diagnoser="real-debugger-v2",   # Claude Sonnet 4.6
)

Cost preview (per case, at 2026-05-20 pricing): ~$0.03 for an average reduced context (~20k tokens). Full 35-case eval ≈ $1.05 + your reducer's own cost.

Command-line

# Define your reducer as `reduce` in a Python file:
cat > my_reducer.py << 'EOF'
def reduce(log):
    return log[-2000:]
EOF

# Evaluate
logdx-ci eval --reducer my_reducer.py --diagnoser stub-debugger-v1 --splits v2/dev

Supported diagnosers

Name What it measures API key Speed Cost
static-signal-recall Did the reducer preserve required signals? (text-only, no LLM) none <1s / 35 cases $0
stub-debugger-v1 Smoke test only (deterministic regex stub) none <1s / 35 cases $0
real-debugger-v2 Did Sonnet 4.6 give a correct diagnosis from the reduced context? claude CLI logged in ~3s / case ~$0.03 / case

Recommended workflow: prototype with static-signal-recall (free, deterministic, 50ms for 35 cases) → confirm pipeline → spend $1 on real-debugger-v2 for leaderboard-comparable diagnosis scores.

V0.2 will add real-debugger-v1 (Haiku), real-debugger-v3 (gpt-5-mini), and real-agent-v1 (Sonnet + 4 tools, 5-turn cap).

Caching

By default, diagnosis results are cached at ~/.logdx_ci_cache/diagnosis/ keyed by (diagnoser, case_id, reduced_context_hash). Re-running the same reducer is free.

Citing

@article{qin2026logdx,
  title         = {{LogDx-CI}: Benchmarking Log Reduction Tools
                  for LLM Root-Cause Diagnosis},
  author        = {Qin, Bowen},
  year          = {2026},
  eprint        = {2605.28876},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logdx_ci-0.4.0.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logdx_ci-0.4.0-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file logdx_ci-0.4.0.tar.gz.

File metadata

  • Download URL: logdx_ci-0.4.0.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for logdx_ci-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0f68e38deba28d5f3a1a8784a377168b278d6ba6d5e80eac7368bfeaf4e4f98e
MD5 9161c94c5b695b818bd130b54ac81714
BLAKE2b-256 c644f02253ddb6cdf2e22b20c7a5be3cd3321b1db778e4f438f2ec984689ab8b

See more details on using hashes here.

File details

Details for the file logdx_ci-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: logdx_ci-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 27.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for logdx_ci-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3084f87a7c2ed385c8ed31cdca667e9cbb5a315520edf5383841fa53ed9ba142
MD5 a93a75ee0372e2ec140491ee074934d0
BLAKE2b-256 d17aec277455c1e8d22bdd228a07544f272db293593ca1579065bf0edb72e54b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page