Skip to main content

Agentic TestOps for Python projects: automated pytest execution, failure diagnosis, and repair advice.

Project description

Agentic TestOps

CI

English | 简体中文

Agentic TestOps is a runnable TestOps assistant for Python repositories. It turns a failing test run into a structured engineering report: execute pytest, parse failures, classify likely root causes, rerun failing tests, and produce repair-oriented Markdown/JSON output that can be reviewed by a human or passed to a future code-fixing agent.

The project focuses on the "implementation -> verification -> diagnosis -> improvement" loop for Python codebases, using real tools instead of a slide-only demo.

10-Second Demo

agentic-testops audit examples/service_health --rerun-failures --suggest-fixes
Raw failure Structured diagnosis Patch proposal target
FileNotFoundError filesystem-boundary service_health.py:9
AttributeError object-interface service_health.py:15
NameError symbol-resolution service_health.py:20

See the demo walkthrough, Markdown report, and machine-readable JSON.

Why This Project

Modern AI coding workflows often stop at code generation. Real systems need a feedback loop:

  1. Run the project's tests with the same command a developer would use.
  2. Extract failure signals from noisy tool output.
  3. Diagnose whether the issue is behavior, dependency, API contract, data shape, or input validation.
  4. Generate a report with evidence and concrete repair advice.
  5. Feed the result into the next debugging or patching step.

Agentic TestOps implements the first working slice of that loop, with deterministic behavior that can run in CI without an API key.

Current Features

  • agentic-testops audit <project> CLI.
  • Runs python -m pytest --tb=short -q in the target project.
  • Parses pytest failures from JUnit XML first, with text-output parsing as a fallback.
  • Optionally reruns only parsed failing node IDs with --rerun-failures.
  • Detects flaky failures with --detect-flaky N: each failing test is rerun N extra times and classified as flaky (unstable) or consistent (reproducible), so patch automation can skip unstable targets.
  • Converts pytest timeouts into structured reports instead of crashing.
  • Preserves user-supplied pytest arguments during focused reruns.
  • Diagnoses common Python failure classes:
    • assertion or behavioral regression
    • dependency/import failure
    • API contract mismatch
    • data shape issue
    • filesystem boundary issue
    • input validation boundary bug
    • object interface mismatch
    • symbol resolution error
    • collection/environment failure
  • Writes a professional Markdown report.
  • Writes machine-readable JSON for later agent orchestration.
  • Generates patch proposal objects with target file, suspected line, action, rationale, confidence, and guardrail tests.
  • Uses import-aware AST lookup to localize API-contract patch targets before falling back to a conservative project scan.
  • Generates conservative dry-run unified diff suggestions with --suggest-fixes or --fix-output; the service health demo patch applies cleanly to a temporary copy and makes its tests pass.
  • Optional LLM analysis layer with --llm-explain: the structured failure evidence is sent to an LLM for advisory root-cause explanations rendered alongside the deterministic diagnosis. Works with the Anthropic API and any OpenAI-compatible endpoint (OpenAI, DeepSeek, Qwen, Zhipu, Moonshot, local Ollama/vLLM) via --llm-provider and --llm-base-url. Without an API key the audit runs unchanged and prints a skip notice. No extra dependencies.
  • Ships as a reusable GitHub Action for CI report generation.
  • Includes four deliberately failing example projects, including a deterministic shared-state flake.
  • Includes unit tests, ruff linting, strict mypy type checking, and GitHub Actions CI.

Real-World Evaluation

The tool is evaluated against historical bugs replayed from real open-source projects (more-itertools, tabulate, boltons) using a SWE-bench style "revert source, keep tests" procedure, with results compared against the files and lines the upstream fixes actually changed. The findings — including where the tool fails — are documented in docs/real-world-evaluation.md and reproducible via scripts/evaluate_real_world.py.

Demo Artifacts

Quick Start

Install from PyPI:

pip install agentic-testops
agentic-testops audit path/to/your/project

Or work from a clone:

python -m pip install -e ".[dev]"
python -m pytest
agentic-testops audit examples/buggy_calculator \
  --rerun-failures \
  --suggest-fixes \
  -o reports/buggy-calculator-report.md \
  --json-output reports/buggy-calculator-report.json \
  --fix-output reports/buggy-calculator-fixes.patch

To pass extra pytest arguments, repeat --pytest-arg:

agentic-testops audit . --pytest-arg tests/test_parser.py --pytest-arg=-q

GitHub Action

- uses: Strangelight-Merser/agentic-testops@main
  with:
    project: "."
    output: reports/agentic-testops-report.md
    json-output: reports/agentic-testops-report.json
    fix-output: reports/agentic-testops-fixes.patch
    rerun-failures: "true"
    suggest-fixes: "true"
    job-summary: "true"

See GitHub Action usage for a complete workflow with job summary output and artifact upload.

The example project should fail because divide(10, 0) raises ZeroDivisionError while the test expects ValueError, and average([]) also divides by zero. That is intentional: it demonstrates how the tool converts raw pytest output into repair advice.

For a larger demo with multiple failure categories:

agentic-testops audit examples/task_tracker \
  --rerun-failures \
  --suggest-fixes \
  -o reports/task-tracker-report.md \
  --json-output reports/task-tracker-report.json \
  --fix-output reports/task-tracker-fixes.patch

For a service-style demo that covers filesystem, object interface, and symbol resolution failures:

agentic-testops audit examples/service_health \
  --rerun-failures \
  --suggest-fixes \
  -o reports/service-health-report.md \
  --json-output reports/service-health-report.json \
  --fix-output reports/service-health-fixes.patch

For a flakiness demo that separates an unstable shared-state failure from a reproducible bug:

agentic-testops audit examples/flaky_pipeline \
  --detect-flaky 2 \
  -o reports/flaky-pipeline-report.md \
  --json-output reports/flaky-pipeline-report.json

The report's Flakiness Check table classifies test_fetch_rates_includes_eur as flaky (it depends on a cache warm-up side effect) and test_convert_applies_rate_exactly as consistent (a real off-by-one bug). See the sample flaky pipeline report.

To add an advisory LLM analysis on top of the deterministic diagnosis, use any provider you like:

# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
agentic-testops audit examples/buggy_calculator --llm-explain

# OpenAI
export OPENAI_API_KEY=sk-...
agentic-testops audit examples/buggy_calculator --llm-explain

# Any OpenAI-compatible endpoint (DeepSeek, Qwen, Zhipu, Moonshot, ...)
export OPENAI_API_KEY=sk-...
agentic-testops audit examples/buggy_calculator --llm-explain \
  --llm-base-url https://api.deepseek.com --llm-model deepseek-chat

# Local models (Ollama / vLLM), no API key required
agentic-testops audit examples/buggy_calculator --llm-explain \
  --llm-base-url http://localhost:11434/v1 --llm-model qwen3

--llm-provider auto (the default) picks the protocol from whichever API key is set. The LLM section is clearly marked as advisory, never replaces the deterministic output, and the audit degrades gracefully (with a printed notice) when the key is missing or the request fails — so the same command stays CI-safe.

Example Output

# Agentic TestOps Audit Report

- Status: **FAIL**
- Parsed failures: `2`

## Agentic Rerun

- Status: **FAIL**
- Command: `python -m pytest --tb=short -q test_calculator.py::test_divide_rejects_zero ...`

## Diagnosis

### 1. `test_calculator.py::test_divide_rejects_zero`

- Category: `input-validation`
- Summary: The implementation likely misses validation for an invalid or boundary input.

Repair advice:
- Define the intended behavior for the boundary input: reject, clamp, or return a neutral value.
- Guard the operation close to the source of the invalid value.
- Document the behavior in a test so future agents preserve it.

## Patch Proposals

### 1. `test_calculator.py::test_divide_rejects_zero`

- Target: `calculator.py:2`
- Action: Add explicit validation for the failing boundary input before the unsafe operation.

## Dry-Run Fix Suggestions

These diffs are review previews only. They are not applied automatically.

```diff
--- a/calculator.py
+++ b/calculator.py
@@ -1,2 +1,4 @@
 def divide(a: float, b: float) -> float:
+    if b == 0:
+        raise ValueError("division by zero")
     return a / b
```

Architecture

Target Python project
        |
        v
Pytest runner
        |
        v
JUnit XML + stdout/stderr capture
        |
        v
Failure parser
        |
        v
Rule-based diagnosis agent
        |
        v
Focused failing-test rerun
        |
        v
Patch proposal planner
        |
        v
Markdown / JSON report writer
        |
        v
Human review or future patch-generation agent

The current version uses deterministic diagnosis rules so it can run without API keys. The next version can add an optional LLM layer on top of the structured report, but the base system remains reproducible and easy to evaluate.

Repository Layout

src/agentic_testops/
  cli.py          command-line entry point
  runner.py       pytest execution wrapper
  parser.py       pytest output parser
  diagnoser.py    failure classification and repair advice
  patcher.py      structured patch proposal planner
  fixer.py        conservative dry-run unified diff suggestions
  flake.py        flaky-failure detection through repeated reruns
  llm.py          optional advisory LLM analysis (Anthropic + OpenAI-compatible APIs, stdlib HTTP)
  reporter.py     Markdown and JSON report generation
  models.py       shared dataclasses
examples/
  buggy_calculator/
  flaky_pipeline/
  service_health/
  task_tracker/
docs/
  project-brief.md
  sample-buggy-calculator-report.md
  sample-buggy-calculator-fixes.patch
  sample-service-health-report.md
  sample-service-health-fixes.patch
  sample-task-tracker-report.md
  sample-task-tracker-fixes.patch
tests/
.github/workflows/
  ci.yml
action.yml

Project Status

  • Runnable CLI and reusable GitHub Action are implemented.
  • Markdown, JSON, and dry-run patch artifacts are generated from real pytest runs.
  • JUnit XML parsing is preferred, with conservative text parsing as a fallback.
  • Focused reruns, timeout reports, and portable command rendering are covered by tests.
  • Public examples demonstrate boundary validation, API contract, data shape, empty-state, and shared-state flaky failures.
  • Flakiness detection separates unstable failures from reproducible ones before repair planning.
  • Real-world evaluation replays historical bugs from more-itertools, tabulate, and boltons and documents both hits and misses against upstream fix ground truth.
  • Maintenance files are provided for issues, pull requests, contribution workflow, release checks, and security reporting.

Maintenance

Roadmap

  • Safer AST-backed edit planning for more Python syntax shapes and call patterns.
  • LLM-assisted patch generation building on the explanation layer.
  • GitHub Checks integration that comments summaries on pull requests.
  • Historical project memory for repeated failures and flaky-test signals.
  • Multi-agent roles: runner, triager, patch planner, verifier.
  • Coverage-guided test gap analysis.

Limitations

  • The tool suggests repairs but does not edit target code.
  • Pytest output parsing is intentionally conservative and may miss exotic plugin formats.
  • Diagnosis rules are heuristic; the report is designed to support human review, not replace it.
  • Patch proposals are planning hints, not executable code changes.
  • Dry-run diffs cover only conservative patterns and should be reviewed before use.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_testops-0.2.0.tar.gz (45.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentic_testops-0.2.0-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file agentic_testops-0.2.0.tar.gz.

File metadata

  • Download URL: agentic_testops-0.2.0.tar.gz
  • Upload date:
  • Size: 45.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentic_testops-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9334940eab43a95eab22b67e7861f8f671d046d8ba641f910d4e356694450e7e
MD5 00a0337056a546ff1094987ee2f8f874
BLAKE2b-256 270fada4607a0adef45bebd8099fb7044bf117fa7ed188ec9d1509c018e03e3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_testops-0.2.0.tar.gz:

Publisher: release.yml on Strangelight-Merser/agentic-testops

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentic_testops-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: agentic_testops-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentic_testops-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e622436238177d0cbbc64bb6009035119463d16ed086172ffc1abd63a4d9929
MD5 922075c43df37b4df01b1e6e769acdfe
BLAKE2b-256 d5035e951904103e188f7092cdb57c19a10f79c131d71abb7e607b870708eee9

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_testops-0.2.0-py3-none-any.whl:

Publisher: release.yml on Strangelight-Merser/agentic-testops

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page