Agentic TestOps for Python projects: automated pytest execution, failure diagnosis, and repair advice.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

LinTianyi

These details have not been verified by PyPI

Project description

Agentic TestOps

English | 简体中文

Agentic TestOps is a runnable TestOps assistant for Python repositories. It turns a failing test run into a structured engineering report: execute pytest, parse failures, classify likely root causes, rerun failing tests, and produce repair-oriented Markdown/JSON output that can be reviewed by a human or passed to a future code-fixing agent.

The project focuses on the "implementation -> verification -> diagnosis -> improvement" loop for Python codebases, using real tools instead of a slide-only demo.

10-Second Demo

agentic-testops audit examples/service_health --rerun-failures --suggest-fixes

Raw failure	Structured diagnosis	Patch proposal target
`FileNotFoundError`	`filesystem-boundary`	`service_health.py:9`
`AttributeError`	`object-interface`	`service_health.py:15`
`NameError`	`symbol-resolution`	`service_health.py:20`

See the demo walkthrough, Markdown report, and machine-readable JSON.

Why This Project

Modern AI coding workflows often stop at code generation. Real systems need a feedback loop:

Run the project's tests with the same command a developer would use.
Extract failure signals from noisy tool output.
Diagnose whether the issue is behavior, dependency, API contract, data shape, or input validation.
Generate a report with evidence and concrete repair advice.
Feed the result into the next debugging or patching step.

Agentic TestOps implements the first working slice of that loop, with deterministic behavior that can run in CI without an API key.

Current Features

agentic-testops audit <project> CLI.
Runs python -m pytest --tb=short -q in the target project.
Parses pytest failures from JUnit XML first, with text-output parsing as a fallback.
Optionally reruns only parsed failing node IDs with --rerun-failures.
Detects flaky failures with --detect-flaky N: each failing test is rerun N extra times and classified as flaky (unstable) or consistent (reproducible), so patch automation can skip unstable targets.
Converts pytest timeouts into structured reports instead of crashing.
Preserves user-supplied pytest arguments during focused reruns.
Diagnoses common Python failure classes:
- assertion or behavioral regression
- dependency/import failure
- API contract mismatch
- data shape issue
- filesystem boundary issue
- input validation boundary bug
- object interface mismatch
- symbol resolution error
- collection/environment failure
Writes a professional Markdown report.
Writes machine-readable JSON for later agent orchestration.
Generates patch proposal objects with target file, suspected line, action, rationale, confidence, and guardrail tests.
Uses import-aware AST lookup to localize API-contract patch targets before falling back to a conservative project scan.
Generates conservative dry-run unified diff suggestions with --suggest-fixes or --fix-output; the service health demo patch applies cleanly to a temporary copy and makes its tests pass.
Optional LLM analysis layer with --llm-explain: the structured failure evidence is sent to an LLM for advisory root-cause explanations rendered alongside the deterministic diagnosis. Works with the Anthropic API and any OpenAI-compatible endpoint (OpenAI, DeepSeek, Qwen, Zhipu, Moonshot, local Ollama/vLLM) via --llm-provider and --llm-base-url. Without an API key the audit runs unchanged and prints a skip notice. No extra dependencies.
Ships as a reusable GitHub Action for CI report generation.
Includes four deliberately failing example projects, including a deterministic shared-state flake.
Includes unit tests, ruff linting, strict mypy type checking, and GitHub Actions CI.

Real-World Evaluation

The tool is evaluated against historical bugs replayed from real open-source projects (more-itertools, tabulate, boltons) using a SWE-bench style "revert source, keep tests" procedure, with results compared against the files and lines the upstream fixes actually changed. The findings — including where the tool fails — are documented in docs/real-world-evaluation.md and reproducible via scripts/evaluate_real_world.py.

Demo Artifacts

Quick Start

Install from PyPI:

pip install agentic-testops
agentic-testops audit path/to/your/project

Or work from a clone:

python -m pip install -e ".[dev]"
python -m pytest
agentic-testops audit examples/buggy_calculator \
  --rerun-failures \
  --suggest-fixes \
  -o reports/buggy-calculator-report.md \
  --json-output reports/buggy-calculator-report.json \
  --fix-output reports/buggy-calculator-fixes.patch

To pass extra pytest arguments, repeat --pytest-arg:

agentic-testops audit . --pytest-arg tests/test_parser.py --pytest-arg=-q

GitHub Action

- uses: Strangelight-Merser/agentic-testops@main
  with:
    project: "."
    output: reports/agentic-testops-report.md
    json-output: reports/agentic-testops-report.json
    fix-output: reports/agentic-testops-fixes.patch
    rerun-failures: "true"
    suggest-fixes: "true"
    job-summary: "true"

See GitHub Action usage for a complete workflow with job summary output and artifact upload.

The example project should fail because divide(10, 0) raises ZeroDivisionError while the test expects ValueError, and average([]) also divides by zero. That is intentional: it demonstrates how the tool converts raw pytest output into repair advice.

For a larger demo with multiple failure categories:

agentic-testops audit examples/task_tracker \
  --rerun-failures \
  --suggest-fixes \
  -o reports/task-tracker-report.md \
  --json-output reports/task-tracker-report.json \
  --fix-output reports/task-tracker-fixes.patch

For a service-style demo that covers filesystem, object interface, and symbol resolution failures:

agentic-testops audit examples/service_health \
  --rerun-failures \
  --suggest-fixes \
  -o reports/service-health-report.md \
  --json-output reports/service-health-report.json \
  --fix-output reports/service-health-fixes.patch

For a flakiness demo that separates an unstable shared-state failure from a reproducible bug:

agentic-testops audit examples/flaky_pipeline \
  --detect-flaky 2 \
  -o reports/flaky-pipeline-report.md \
  --json-output reports/flaky-pipeline-report.json

The report's Flakiness Check table classifies test_fetch_rates_includes_eur as flaky (it depends on a cache warm-up side effect) and test_convert_applies_rate_exactly as consistent (a real off-by-one bug). See the sample flaky pipeline report.

To add an advisory LLM analysis on top of the deterministic diagnosis, use any provider you like:

# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
agentic-testops audit examples/buggy_calculator --llm-explain

# OpenAI
export OPENAI_API_KEY=sk-...
agentic-testops audit examples/buggy_calculator --llm-explain

# Any OpenAI-compatible endpoint (DeepSeek, Qwen, Zhipu, Moonshot, ...)
export OPENAI_API_KEY=sk-...
agentic-testops audit examples/buggy_calculator --llm-explain \
  --llm-base-url https://api.deepseek.com --llm-model deepseek-chat

# Local models (Ollama / vLLM), no API key required
agentic-testops audit examples/buggy_calculator --llm-explain \
  --llm-base-url http://localhost:11434/v1 --llm-model qwen3

--llm-provider auto (the default) picks the protocol from whichever API key is set. The LLM section is clearly marked as advisory, never replaces the deterministic output, and the audit degrades gracefully (with a printed notice) when the key is missing or the request fails — so the same command stays CI-safe.

Example Output

# Agentic TestOps Audit Report

- Status: **FAIL**
- Parsed failures: `2`

## Agentic Rerun

- Status: **FAIL**
- Command: `python -m pytest --tb=short -q test_calculator.py::test_divide_rejects_zero ...`

## Diagnosis

### 1. `test_calculator.py::test_divide_rejects_zero`

- Category: `input-validation`
- Summary: The implementation likely misses validation for an invalid or boundary input.

Repair advice:
- Define the intended behavior for the boundary input: reject, clamp, or return a neutral value.
- Guard the operation close to the source of the invalid value.
- Document the behavior in a test so future agents preserve it.

## Patch Proposals

### 1. `test_calculator.py::test_divide_rejects_zero`

- Target: `calculator.py:2`
- Action: Add explicit validation for the failing boundary input before the unsafe operation.

## Dry-Run Fix Suggestions

These diffs are review previews only. They are not applied automatically.

```diff
--- a/calculator.py
+++ b/calculator.py
@@ -1,2 +1,4 @@
 def divide(a: float, b: float) -> float:
+    if b == 0:
+        raise ValueError("division by zero")
     return a / b
```

Architecture

Target Python project
        |
        v
Pytest runner
        |
        v
JUnit XML + stdout/stderr capture
        |
        v
Failure parser
        |
        v
Rule-based diagnosis agent
        |
        v
Focused failing-test rerun
        |
        v
Patch proposal planner
        |
        v
Markdown / JSON report writer
        |
        v
Human review or future patch-generation agent

The current version uses deterministic diagnosis rules so it can run without API keys. The next version can add an optional LLM layer on top of the structured report, but the base system remains reproducible and easy to evaluate.

Repository Layout

src/agentic_testops/
  cli.py          command-line entry point
  runner.py       pytest execution wrapper
  parser.py       pytest output parser
  diagnoser.py    failure classification and repair advice
  patcher.py      structured patch proposal planner
  fixer.py        conservative dry-run unified diff suggestions
  flake.py        flaky-failure detection through repeated reruns
  llm.py          optional advisory LLM analysis (Anthropic + OpenAI-compatible APIs, stdlib HTTP)
  reporter.py     Markdown and JSON report generation
  models.py       shared dataclasses
examples/
  buggy_calculator/
  flaky_pipeline/
  service_health/
  task_tracker/
docs/
  project-brief.md
  sample-buggy-calculator-report.md
  sample-buggy-calculator-fixes.patch
  sample-service-health-report.md
  sample-service-health-fixes.patch
  sample-task-tracker-report.md
  sample-task-tracker-fixes.patch
tests/
.github/workflows/
  ci.yml
action.yml

Project Status

Runnable CLI and reusable GitHub Action are implemented.
Markdown, JSON, and dry-run patch artifacts are generated from real pytest runs.
JUnit XML parsing is preferred, with conservative text parsing as a fallback.
Focused reruns, timeout reports, and portable command rendering are covered by tests.
Public examples demonstrate boundary validation, API contract, data shape, empty-state, and shared-state flaky failures.
Flakiness detection separates unstable failures from reproducible ones before repair planning.
Real-world evaluation replays historical bugs from more-itertools, tabulate, and boltons and documents both hits and misses against upstream fix ground truth.
Maintenance files are provided for issues, pull requests, contribution workflow, release checks, and security reporting.

Maintenance

Roadmap

Safer AST-backed edit planning for more Python syntax shapes and call patterns.
LLM-assisted patch generation building on the explanation layer.
GitHub Checks integration that comments summaries on pull requests.
Historical project memory for repeated failures and flaky-test signals.
Multi-agent roles: runner, triager, patch planner, verifier.
Coverage-guided test gap analysis.

Limitations

The tool suggests repairs but does not edit target code.
Pytest output parsing is intentionally conservative and may miss exotic plugin formats.
Diagnosis rules are heuristic; the report is designed to support human review, not replace it.
Patch proposals are planning hints, not executable code changes.
Dry-run diffs cover only conservative patterns and should be reviewed before use.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

LinTianyi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_testops-0.2.0.tar.gz (45.5 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_testops-0.2.0-py3-none-any.whl (32.9 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file agentic_testops-0.2.0.tar.gz.

File metadata

Download URL: agentic_testops-0.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 45.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentic_testops-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9334940eab43a95eab22b67e7861f8f671d046d8ba641f910d4e356694450e7e`
MD5	`00a0337056a546ff1094987ee2f8f874`
BLAKE2b-256	`270fada4607a0adef45bebd8099fb7044bf117fa7ed188ec9d1509c018e03e3f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_testops-0.2.0.tar.gz:

Publisher: release.yml on Strangelight-Merser/agentic-testops

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentic_testops-0.2.0.tar.gz
- Subject digest: 9334940eab43a95eab22b67e7861f8f671d046d8ba641f910d4e356694450e7e
- Sigstore transparency entry: 1774177904
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: Strangelight-Merser/agentic-testops@0b96eb3822500a6b35d4420af1675b9d2bdc7e29
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Strangelight-Merser
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0b96eb3822500a6b35d4420af1675b9d2bdc7e29
- Trigger Event: release

File details

Details for the file agentic_testops-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentic_testops-0.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 32.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentic_testops-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e622436238177d0cbbc64bb6009035119463d16ed086172ffc1abd63a4d9929`
MD5	`922075c43df37b4df01b1e6e769acdfe`
BLAKE2b-256	`d5035e951904103e188f7092cdb57c19a10f79c131d71abb7e607b870708eee9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_testops-0.2.0-py3-none-any.whl:

Publisher: release.yml on Strangelight-Merser/agentic-testops

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentic_testops-0.2.0-py3-none-any.whl
- Subject digest: 0e622436238177d0cbbc64bb6009035119463d16ed086172ffc1abd63a4d9929
- Sigstore transparency entry: 1774178045
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: Strangelight-Merser/agentic-testops@0b96eb3822500a6b35d4420af1675b9d2bdc7e29
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Strangelight-Merser
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0b96eb3822500a6b35d4420af1675b9d2bdc7e29
- Trigger Event: release

agentic-testops 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Agentic TestOps

10-Second Demo

Why This Project

Current Features

Real-World Evaluation

Demo Artifacts

Quick Start

GitHub Action

Example Output

Architecture

Repository Layout

Project Status

Maintenance

Roadmap

Limitations

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance