Agentic TestOps for Python projects: automated pytest execution, failure diagnosis, and repair advice.
Project description
Agentic TestOps
English | 简体中文
Agentic TestOps is a runnable TestOps assistant for Python repositories. It turns a failing test run into a structured engineering report: execute pytest, parse failures, classify likely root causes, rerun failing tests, and produce repair-oriented Markdown/JSON output that can be reviewed by a human or passed to a future code-fixing agent.
The project focuses on the "implementation -> verification -> diagnosis -> improvement" loop for Python codebases, using real tools instead of a slide-only demo.
10-Second Demo
agentic-testops audit examples/service_health --rerun-failures --suggest-fixes
| Raw failure | Structured diagnosis | Patch proposal target |
|---|---|---|
FileNotFoundError |
filesystem-boundary |
service_health.py:9 |
AttributeError |
object-interface |
service_health.py:15 |
NameError |
symbol-resolution |
service_health.py:20 |
See the demo walkthrough, Markdown report, and machine-readable JSON.
Why This Project
Modern AI coding workflows often stop at code generation. Real systems need a feedback loop:
- Run the project's tests with the same command a developer would use.
- Extract failure signals from noisy tool output.
- Diagnose whether the issue is behavior, dependency, API contract, data shape, or input validation.
- Generate a report with evidence and concrete repair advice.
- Feed the result into the next debugging or patching step.
Agentic TestOps implements the first working slice of that loop, with deterministic behavior that can run in CI without an API key.
Current Features
agentic-testops audit <project>CLI.- Runs
python -m pytest --tb=short -qin the target project. - Parses pytest failures from JUnit XML first, with text-output parsing as a fallback.
- Optionally reruns only parsed failing node IDs with
--rerun-failures. - Detects flaky failures with
--detect-flaky N: each failing test is rerun N extra times and classified asflaky(unstable) orconsistent(reproducible), so patch automation can skip unstable targets. - Converts pytest timeouts into structured reports instead of crashing.
- Preserves user-supplied pytest arguments during focused reruns.
- Diagnoses common Python failure classes:
- assertion or behavioral regression
- dependency/import failure
- API contract mismatch
- data shape issue
- filesystem boundary issue
- input validation boundary bug
- object interface mismatch
- symbol resolution error
- collection/environment failure
- Writes a professional Markdown report.
- Writes machine-readable JSON for later agent orchestration.
- Generates patch proposal objects with target file, suspected line, action, rationale, confidence, and guardrail tests.
- Uses import-aware AST lookup to localize API-contract patch targets before falling back to a conservative project scan.
- Generates conservative dry-run unified diff suggestions with
--suggest-fixesor--fix-output; the service health demo patch applies cleanly to a temporary copy and makes its tests pass. - Optional LLM analysis layer with
--llm-explain: the structured failure evidence is sent to an LLM for advisory root-cause explanations rendered alongside the deterministic diagnosis. Works with the Anthropic API and any OpenAI-compatible endpoint (OpenAI, DeepSeek, Qwen, Zhipu, Moonshot, local Ollama/vLLM) via--llm-providerand--llm-base-url. Without an API key the audit runs unchanged and prints a skip notice. No extra dependencies. - Ships as a reusable GitHub Action for CI report generation.
- Includes four deliberately failing example projects, including a deterministic shared-state flake.
- Includes unit tests, ruff linting, strict mypy type checking, and GitHub Actions CI.
Real-World Evaluation
The tool is evaluated against historical bugs replayed from real open-source projects (more-itertools, tabulate, boltons) using a SWE-bench style "revert source, keep tests" procedure, with results compared against the files and lines the upstream fixes actually changed. The findings — including where the tool fails — are documented in docs/real-world-evaluation.md and reproducible via scripts/evaluate_real_world.py.
Demo Artifacts
- Real-world evaluation
- Project brief
- Demo walkthrough
- Buggy calculator report
- Buggy calculator dry-run fixes
- Task tracker report
- Task tracker dry-run fixes
- Machine-readable task tracker JSON
- Flaky pipeline report
- Machine-readable flaky pipeline JSON
- Service health report
- Service health dry-run fixes
- Machine-readable service health JSON
- GitHub Action usage
Quick Start
Install from PyPI:
pip install agentic-testops
agentic-testops audit path/to/your/project
Or work from a clone:
python -m pip install -e ".[dev]"
python -m pytest
agentic-testops audit examples/buggy_calculator \
--rerun-failures \
--suggest-fixes \
-o reports/buggy-calculator-report.md \
--json-output reports/buggy-calculator-report.json \
--fix-output reports/buggy-calculator-fixes.patch
To pass extra pytest arguments, repeat --pytest-arg:
agentic-testops audit . --pytest-arg tests/test_parser.py --pytest-arg=-q
GitHub Action
- uses: Strangelight-Merser/agentic-testops@main
with:
project: "."
output: reports/agentic-testops-report.md
json-output: reports/agentic-testops-report.json
fix-output: reports/agentic-testops-fixes.patch
rerun-failures: "true"
suggest-fixes: "true"
job-summary: "true"
See GitHub Action usage for a complete workflow with job summary output and artifact upload.
The example project should fail because divide(10, 0) raises ZeroDivisionError while the test expects ValueError, and average([]) also divides by zero. That is intentional: it demonstrates how the tool converts raw pytest output into repair advice.
For a larger demo with multiple failure categories:
agentic-testops audit examples/task_tracker \
--rerun-failures \
--suggest-fixes \
-o reports/task-tracker-report.md \
--json-output reports/task-tracker-report.json \
--fix-output reports/task-tracker-fixes.patch
For a service-style demo that covers filesystem, object interface, and symbol resolution failures:
agentic-testops audit examples/service_health \
--rerun-failures \
--suggest-fixes \
-o reports/service-health-report.md \
--json-output reports/service-health-report.json \
--fix-output reports/service-health-fixes.patch
For a flakiness demo that separates an unstable shared-state failure from a reproducible bug:
agentic-testops audit examples/flaky_pipeline \
--detect-flaky 2 \
-o reports/flaky-pipeline-report.md \
--json-output reports/flaky-pipeline-report.json
The report's Flakiness Check table classifies test_fetch_rates_includes_eur as flaky (it depends on a cache warm-up side effect) and test_convert_applies_rate_exactly as consistent (a real off-by-one bug). See the sample flaky pipeline report.
To add an advisory LLM analysis on top of the deterministic diagnosis, use any provider you like:
# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
agentic-testops audit examples/buggy_calculator --llm-explain
# OpenAI
export OPENAI_API_KEY=sk-...
agentic-testops audit examples/buggy_calculator --llm-explain
# Any OpenAI-compatible endpoint (DeepSeek, Qwen, Zhipu, Moonshot, ...)
export OPENAI_API_KEY=sk-...
agentic-testops audit examples/buggy_calculator --llm-explain \
--llm-base-url https://api.deepseek.com --llm-model deepseek-chat
# Local models (Ollama / vLLM), no API key required
agentic-testops audit examples/buggy_calculator --llm-explain \
--llm-base-url http://localhost:11434/v1 --llm-model qwen3
--llm-provider auto (the default) picks the protocol from whichever API key is set. The LLM section is clearly marked as advisory, never replaces the deterministic output, and the audit degrades gracefully (with a printed notice) when the key is missing or the request fails — so the same command stays CI-safe.
Example Output
# Agentic TestOps Audit Report
- Status: **FAIL**
- Parsed failures: `2`
## Agentic Rerun
- Status: **FAIL**
- Command: `python -m pytest --tb=short -q test_calculator.py::test_divide_rejects_zero ...`
## Diagnosis
### 1. `test_calculator.py::test_divide_rejects_zero`
- Category: `input-validation`
- Summary: The implementation likely misses validation for an invalid or boundary input.
Repair advice:
- Define the intended behavior for the boundary input: reject, clamp, or return a neutral value.
- Guard the operation close to the source of the invalid value.
- Document the behavior in a test so future agents preserve it.
## Patch Proposals
### 1. `test_calculator.py::test_divide_rejects_zero`
- Target: `calculator.py:2`
- Action: Add explicit validation for the failing boundary input before the unsafe operation.
## Dry-Run Fix Suggestions
These diffs are review previews only. They are not applied automatically.
```diff
--- a/calculator.py
+++ b/calculator.py
@@ -1,2 +1,4 @@
def divide(a: float, b: float) -> float:
+ if b == 0:
+ raise ValueError("division by zero")
return a / b
```
Architecture
Target Python project
|
v
Pytest runner
|
v
JUnit XML + stdout/stderr capture
|
v
Failure parser
|
v
Rule-based diagnosis agent
|
v
Focused failing-test rerun
|
v
Patch proposal planner
|
v
Markdown / JSON report writer
|
v
Human review or future patch-generation agent
The current version uses deterministic diagnosis rules so it can run without API keys. The next version can add an optional LLM layer on top of the structured report, but the base system remains reproducible and easy to evaluate.
Repository Layout
src/agentic_testops/
cli.py command-line entry point
runner.py pytest execution wrapper
parser.py pytest output parser
diagnoser.py failure classification and repair advice
patcher.py structured patch proposal planner
fixer.py conservative dry-run unified diff suggestions
flake.py flaky-failure detection through repeated reruns
llm.py optional advisory LLM analysis (Anthropic + OpenAI-compatible APIs, stdlib HTTP)
reporter.py Markdown and JSON report generation
models.py shared dataclasses
examples/
buggy_calculator/
flaky_pipeline/
service_health/
task_tracker/
docs/
project-brief.md
sample-buggy-calculator-report.md
sample-buggy-calculator-fixes.patch
sample-service-health-report.md
sample-service-health-fixes.patch
sample-task-tracker-report.md
sample-task-tracker-fixes.patch
tests/
.github/workflows/
ci.yml
action.yml
Project Status
- Runnable CLI and reusable GitHub Action are implemented.
- Markdown, JSON, and dry-run patch artifacts are generated from real pytest runs.
- JUnit XML parsing is preferred, with conservative text parsing as a fallback.
- Focused reruns, timeout reports, and portable command rendering are covered by tests.
- Public examples demonstrate boundary validation, API contract, data shape, empty-state, and shared-state flaky failures.
- Flakiness detection separates unstable failures from reproducible ones before repair planning.
- Real-world evaluation replays historical bugs from more-itertools, tabulate, and boltons and documents both hits and misses against upstream fix ground truth.
- Maintenance files are provided for issues, pull requests, contribution workflow, release checks, and security reporting.
Maintenance
Roadmap
- Safer AST-backed edit planning for more Python syntax shapes and call patterns.
- LLM-assisted patch generation building on the explanation layer.
- GitHub Checks integration that comments summaries on pull requests.
- Historical project memory for repeated failures and flaky-test signals.
- Multi-agent roles: runner, triager, patch planner, verifier.
- Coverage-guided test gap analysis.
Limitations
- The tool suggests repairs but does not edit target code.
- Pytest output parsing is intentionally conservative and may miss exotic plugin formats.
- Diagnosis rules are heuristic; the report is designed to support human review, not replace it.
- Patch proposals are planning hints, not executable code changes.
- Dry-run diffs cover only conservative patterns and should be reviewed before use.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentic_testops-0.2.0.tar.gz.
File metadata
- Download URL: agentic_testops-0.2.0.tar.gz
- Upload date:
- Size: 45.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9334940eab43a95eab22b67e7861f8f671d046d8ba641f910d4e356694450e7e
|
|
| MD5 |
00a0337056a546ff1094987ee2f8f874
|
|
| BLAKE2b-256 |
270fada4607a0adef45bebd8099fb7044bf117fa7ed188ec9d1509c018e03e3f
|
Provenance
The following attestation bundles were made for agentic_testops-0.2.0.tar.gz:
Publisher:
release.yml on Strangelight-Merser/agentic-testops
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentic_testops-0.2.0.tar.gz -
Subject digest:
9334940eab43a95eab22b67e7861f8f671d046d8ba641f910d4e356694450e7e - Sigstore transparency entry: 1774177904
- Sigstore integration time:
-
Permalink:
Strangelight-Merser/agentic-testops@0b96eb3822500a6b35d4420af1675b9d2bdc7e29 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Strangelight-Merser
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b96eb3822500a6b35d4420af1675b9d2bdc7e29 -
Trigger Event:
release
-
Statement type:
File details
Details for the file agentic_testops-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agentic_testops-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e622436238177d0cbbc64bb6009035119463d16ed086172ffc1abd63a4d9929
|
|
| MD5 |
922075c43df37b4df01b1e6e769acdfe
|
|
| BLAKE2b-256 |
d5035e951904103e188f7092cdb57c19a10f79c131d71abb7e607b870708eee9
|
Provenance
The following attestation bundles were made for agentic_testops-0.2.0-py3-none-any.whl:
Publisher:
release.yml on Strangelight-Merser/agentic-testops
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentic_testops-0.2.0-py3-none-any.whl -
Subject digest:
0e622436238177d0cbbc64bb6009035119463d16ed086172ffc1abd63a4d9929 - Sigstore transparency entry: 1774178045
- Sigstore integration time:
-
Permalink:
Strangelight-Merser/agentic-testops@0b96eb3822500a6b35d4420af1675b9d2bdc7e29 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Strangelight-Merser
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b96eb3822500a6b35d4420af1675b9d2bdc7e29 -
Trigger Event:
release
-
Statement type: