Evaluate log reduction tools against the LogDx-CI corpus (35 real CI-failure cases) — pip install + 5-line Python = score vs 12 reference methods.
Project description
logdx-ci
Evaluation harness for log reduction tools targeting LLM root-cause diagnosis on CI failures. Wraps the LogDx-CI v1.2 corpus (35 real GitHub Actions failure cases, AI-drafted + author-verified ground truth) into a five-minute Python API.
Install
pip install logdx-ci
The corpus + scoring code (~20 MB) is auto-fetched from the LogDx
GitHub release on first use and cached at ~/.logdx_ci_cache/repo/.
No clone required.
For the LLM-based diagnosers (real-debugger-v1/v2/v3) you also need
either the claude CLI on PATH (Haiku / Sonnet) or OPENAI_API_KEY
(gpt-5-mini). The default static-signal-recall diagnoser needs
neither — runs deterministic, free, in under a second.
Five-minute tutorial
import logdx_ci
# 1. Define your log reducer
def my_reducer(raw_log: str) -> str:
"""Toy: keep only lines containing 'error'."""
return "\n".join(
line for line in raw_log.split("\n")
if "error" in line.lower()
)
# 2. Evaluate on the corpus (default = static, no LLM, no API key, <1s)
result = logdx_ci.evaluate(
reducer=my_reducer,
# diagnoser defaults to "static-signal-recall"
# splits defaults to all 6 (= 35 cases)
)
# 3. Inspect
print(result.summary())
Output:
LogDx-CI evaluation result
diagnoser: static-signal-recall
cases evaluated: 35
critical_signal_recall: 0.7536
mean reduced chars: 3,053
elapsed: 0.05 sec
closest baseline: tail (0.754, +0.000)
method csr tokens note
--------------------------------------------------------------------------------
**YOU** 0.7536 3,053
raw 0.9649 — +0.211 vs you
rtk-read 0.9649 — +0.211 vs you
grep 0.8411 — +0.087 vs you
hybrid-grep-120k-rtk-tail-v3 0.8225 — +0.069 vs you
hybrid-grep-120k-tail-v2 0.8189 — +0.065 vs you
llm-summary-v1-gpt-5-mini 0.8104 — +0.057 vs you
tail 0.7536 — +0.000 vs you
llm-summary-v1-haiku 0.7009 — -0.053 vs you
hybrid-grep-4k-rtk-err-cat-v1 0.6810 — -0.073 vs you
rtk-err-cat 0.5372 — -0.216 vs you
rtk-log 0.1819 — -0.572 vs you
Use the real diagnoser
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
result = logdx_ci.evaluate(
reducer=my_reducer,
diagnoser="real-debugger-v2", # Claude Sonnet 4.6
)
Cost preview (per case, at 2026-05-20 pricing): ~$0.03 for an average reduced context (~20k tokens). Full 35-case eval ≈ $1.05 + your reducer's own cost.
Command-line
# Define your reducer as `reduce` in a Python file:
cat > my_reducer.py << 'EOF'
def reduce(log):
return log[-2000:]
EOF
# Evaluate
logdx-ci eval --reducer my_reducer.py --diagnoser stub-debugger-v1 --splits v2/dev
Supported diagnosers
| Name | What it measures | API key | Speed | Cost |
|---|---|---|---|---|
static-signal-recall |
Did the reducer preserve required signals? (text-only, no LLM) | none | <1s / 35 cases | $0 |
stub-debugger-v1 |
Smoke test only (deterministic regex stub) | none | <1s / 35 cases | $0 |
real-debugger-v2 |
Did Sonnet 4.6 give a correct diagnosis from the reduced context? | claude CLI logged in |
~3s / case | ~$0.03 / case |
Recommended workflow: prototype with static-signal-recall (free,
deterministic, 50ms for 35 cases) → confirm pipeline → spend $1 on
real-debugger-v2 for leaderboard-comparable diagnosis scores.
V0.2 will add real-debugger-v1 (Haiku), real-debugger-v3 (gpt-5-mini),
and real-agent-v1 (Sonnet + 4 tools, 5-turn cap).
Caching
By default, diagnosis results are cached at ~/.logdx_ci_cache/diagnosis/
keyed by (diagnoser, case_id, reduced_context_hash). Re-running the same
reducer is free.
Citing
@article{qin2026logdx,
title = {{LogDx-CI}: Benchmarking Log Reduction Tools
for LLM Root-Cause Diagnosis},
author = {Qin, Bowen},
year = {2026},
eprint = {2605.28876},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file logdx_ci-0.4.0.tar.gz.
File metadata
- Download URL: logdx_ci-0.4.0.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f68e38deba28d5f3a1a8784a377168b278d6ba6d5e80eac7368bfeaf4e4f98e
|
|
| MD5 |
9161c94c5b695b818bd130b54ac81714
|
|
| BLAKE2b-256 |
c644f02253ddb6cdf2e22b20c7a5be3cd3321b1db778e4f438f2ec984689ab8b
|
File details
Details for the file logdx_ci-0.4.0-py3-none-any.whl.
File metadata
- Download URL: logdx_ci-0.4.0-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3084f87a7c2ed385c8ed31cdca667e9cbb5a315520edf5383841fa53ed9ba142
|
|
| MD5 |
a93a75ee0372e2ec140491ee074934d0
|
|
| BLAKE2b-256 |
d17aec277455c1e8d22bdd228a07544f272db293593ca1579065bf0edb72e54b
|