Scenario-based harness for improving AI agents through repair and confirmation loops

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Tinkerloop

AI agents fail in ways unit tests can't catch — wrong tool calls, hallucinated answers, broken multi-turn flows. Tinkerloop is a scenario-based testing harness that replays conversations against your agent, runs deterministic checks on every response and tool call, and gives you a clear diagnosis when something breaks.

Why Tinkerloop

Deterministic checks, not LLM-as-judge. Every check is reproducible, fast, and free. No flaky evals, no paying for a judge model.
Assert on tool calls, not just text. Verify your agent called the right tools with the right arguments — not just that the final response looks plausible.
Two-stage run / confirm. Fast repair loops iterate on failures quickly; a separate confirmation loop validates against the real agent before you trust the result.
CI-gatable by design. Exit codes distinguish "repair passed" (code 3) from "confirmed green" (code 0) so your pipeline can enforce the difference.
Architecture-agnostic. Works with LangGraph, MCP tool servers, custom orchestrators, or any app that exposes a callable or command entrypoint. The target app doesn't need to be Python.
Built for coding agents. Point Codex, Claude Code, Copilot, or Cursor at the installed package or the repo and the agent picks up the CLI, diagnosis artifacts, and repair loop without additional setup. The outer coding model reads the diagnosis, patches your agent, and reruns — that's the intended workflow.

How It Works

write scenario → tinkerloop run → diagnose → patch → rerun → tinkerloop confirm

Write a JSON scenario with user turns and deterministic checks
tinkerloop run replays the conversation through your agent's adapter
Tinkerloop evaluates every check and writes a diagnosis artifact
Fix the target code using the diagnosis
Rerun failed scenarios until they pass
tinkerloop confirm validates against the real agent before the result is final

Example

A scenario that tests whether a RAG agent looks up the right document and answers correctly:

{
  "scenario_id": "lookup_refund_policy",
  "description": "Agent should search the knowledge base and cite the refund window",
  "tags": ["rag", "policy"],
  "turns": [
    {
      "user": "What is your refund policy?",
      "checks": [
        { "type": "tool_used", "values": ["search_knowledge_base"] },
        { "type": "assistant_contains_all", "values": ["30 days", "full refund"] },
        { "type": "assistant_not_contains", "values": ["I don't know", "I'm not sure"] }
      ]
    }
  ]
}

Run it:

tinkerloop run \
  --adapter your_project/tinkerloop_adapter.py:create_adapter \
  --user-id test-user \
  --scenarios scenarios/

When a check fails, Tinkerloop tells you exactly what went wrong:

Scenarios: 1, passed: 0, failed: 1
- [FAIL] lookup_refund_policy: Agent should search the knowledge base and cite the refund window
  turn 1: missing tools: ['search_knowledge_base']
          missing substrings: ['30 days']

The diagnosis artifact (latest-diagnosis.json) gives structured evidence for automated or manual triage:

{
  "summary": {
    "failed_scenario_count": 1,
    "failed_scenario_ids": ["lookup_refund_policy"]
  },
  "diagnosis_items": [
    {
      "scenario_id": "lookup_refund_policy",
      "primary_symptoms": [
        "missing tools: ['search_knowledge_base']",
        "missing substrings: ['30 days']"
      ]
    }
  ]
}

Fix the agent, rerun, confirm:

tinkerloop run --adapter ... --scenarios scenarios/ --failed-from artifacts/reports

tinkerloop confirm --adapter ... --scenarios scenarios/ --non-interactive

Quick Start

Install from PyPI:

pip install tinkerloop-ai

Create an adapter in your target repo:

from tinkerloop.adapters import PythonAppAdapter

def create_adapter() -> PythonAppAdapter:
    return PythonAppAdapter(
        handler_path="your_app.agent:handle_message",
        patch_targets=["your_app.tools:execute_tool"],
    )

If your app uses a runner command, use CommandAppAdapter. For apps that need full control over trace capture, subclass AppAdapter directly. See the Adapter Guide.

Add a scenario JSON file (see Example above), then run:

tinkerloop run \
  --adapter your_project/tinkerloop_adapter.py:create_adapter \
  --user-id test-user \
  --scenarios scenarios/

When the repair loop passes (exit code 3), confirm against the real agent:

tinkerloop confirm \
  --adapter your_project/tinkerloop_adapter.py:create_adapter \
  --user-id test-user \
  --scenarios scenarios/ \
  --non-interactive

--adapter accepts an import path (your_project.adapter:create_adapter) or a file path (/path/to/adapter.py:create_adapter).

From Source

git clone https://github.com/bostoneco/tinkerloop.git
cd tinkerloop
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
tinkerloop run \
  --adapter examples/starter_target/adapter.py:create_adapter \
  --user-id demo-user \
  --scenarios examples/starter_target/scenarios

What You Can Check

Check type	What it asserts	Scenario JSON
`assistant_contains_all`	Response includes every substring	`"values": ["30 days", "full refund"]`
`assistant_contains_any`	Response includes at least one	`"values": ["approved", "confirmed"]`
`assistant_not_contains`	Response includes none of these	`"values": ["I don't know"]`
`tool_used`	Agent called the named tool	`"values": ["search_knowledge_base"]`
`tool_call_count_at_most`	Tool was called at most N times	`"tool": "send_email", "max": 1`
`tool_call_matches`	Tool was called with specific arguments	`"tool": "send_email", "arguments": {"confirmed": true}`

Checks are intentionally narrow and deterministic. If a behavior matters, encode it as an explicit check rather than relying on a broad pass/fail signal.

Multi-Turn and Destructive Scenarios

Scenarios can span multiple turns and assert on each one independently. Mark destructive scenarios so they can be gated or isolated:

{
  "scenario_id": "send_email_with_confirmation",
  "description": "Agent should compose, confirm, then send — checking tool args at each step",
  "destructive": true,
  "tags": ["email", "guardrail"],
  "turns": [
    {
      "user": "Send a test email to alice@example.com",
      "checks": [
        { "type": "tool_used", "values": ["compose"] },
        { "type": "assistant_contains_all", "values": ["Confirm to send"] }
      ]
    },
    {
      "user": "send",
      "checks": [
        { "type": "tool_call_matches", "tool": "compose", "arguments": { "confirmed": true } },
        { "type": "assistant_contains_all", "values": ["Email sent"] }
      ]
    }
  ]
}

Run a tagged slice:

tinkerloop run --adapter ... --scenarios scenarios/ --tag email --tag guardrail

Tested Against Real Agents

Tinkerloop has been validated against multiple production agent architectures:

A multi-turn conversational email agent with 25+ scenarios covering destructive flows, guardrails, undo operations, and tool argument assertions via CommandAppAdapter
An enterprise investigation agent built on LangGraph with live API integrations, a custom AppAdapter subclass, and dynamic tool trace capture

These integrations exercise different adapter shapes, check types, and orchestrator complexities across the full contract surface.

Artifacts

Each run writes structured JSON artifacts for tooling and CI integration:

Artifact	Purpose
`latest.json`	Full report for the most recent run
`latest-failures.json`	Failed scenarios only
`latest-diagnosis.json`	Structured diagnosis with symptoms and confirmation status
`confirm-latest.json`	Full report for the most recent confirmation run
`confirm-latest-failures.json`	Failed confirmation scenarios
`confirm-latest-diagnosis.json`	Confirmation diagnosis

Repair and confirmation artifacts are kept separate. A green repair loop is not a final pass — confirmation_status in the diagnosis artifact tracks whether the result has been confirmed against the real agent.

Docs

Quickstart: Target Repo — minimal integration path
Adapter Guide — PythonAppAdapter vs CommandAppAdapter vs AppAdapter
Target Contract — public integration boundary
Actor Model — inner target orchestrator vs outer coding model
Trust Model — what pass/fail results mean and don't mean
Worked Example — failure → diagnosis → rerun walkthrough
Troubleshooting — first-run failure modes
Stability — supported v0.x surface and experimental boundaries

Status

Tinkerloop is in alpha (v0.1.x). The CLI commands, adapter interfaces, check types, and report schemas documented above are the supported contract for v0.x. See STABILITY.md for the full stability boundary.

The project is designed for teams that can own a target adapter and scenario library. It is not a zero-config framework or a general benchmark suite.

License

Apache License 2.0 — use, modify, and distribute freely; includes a patent grant. See LICENSE.

Contributing

PRs are accepted from maintainers and invited contributors. For bugs or ideas, open an issue. See CONTRIBUTING.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bostoneco

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Apr 4, 2026

This version

0.1.6

Apr 4, 2026

0.1.5

Apr 3, 2026

0.1.4

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinkerloop_ai-0.1.6.tar.gz (20.9 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tinkerloop_ai-0.1.6-py3-none-any.whl (25.3 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file tinkerloop_ai-0.1.6.tar.gz.

File metadata

Download URL: tinkerloop_ai-0.1.6.tar.gz
Upload date: Apr 4, 2026
Size: 20.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tinkerloop_ai-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`b478b9ff77377c2c7850b3b7276c2c05fc7f812bcdace42db2972b5f9bc11f12`
MD5	`196563a8f62bd157079854c6a436c246`
BLAKE2b-256	`f584ebe0050d04917240b57efe56ac1538326296ec497ac4827d5b54507dc2e5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tinkerloop_ai-0.1.6.tar.gz:

Publisher: release-wheel.yml on bostoneco/tinkerloop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tinkerloop_ai-0.1.6.tar.gz
- Subject digest: b478b9ff77377c2c7850b3b7276c2c05fc7f812bcdace42db2972b5f9bc11f12
- Sigstore transparency entry: 1231324326
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: bostoneco/tinkerloop@77338496079f16d95404915ec34d9a1eafaaae83
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/bostoneco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-wheel.yml@77338496079f16d95404915ec34d9a1eafaaae83
- Trigger Event: release

File details

Details for the file tinkerloop_ai-0.1.6-py3-none-any.whl.

File metadata

Download URL: tinkerloop_ai-0.1.6-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 25.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tinkerloop_ai-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17315c5fa5b34e4e653837746997203796d0780812e58ffcbf8b8f4191dafe3a`
MD5	`ecfc1decb76da60ae0f410890ca40005`
BLAKE2b-256	`0f1420e0506e46732851f1b28621ec95b1abaaa348ad2db1651a3a66b938c886`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tinkerloop_ai-0.1.6-py3-none-any.whl:

Publisher: release-wheel.yml on bostoneco/tinkerloop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tinkerloop_ai-0.1.6-py3-none-any.whl
- Subject digest: 17315c5fa5b34e4e653837746997203796d0780812e58ffcbf8b8f4191dafe3a
- Sigstore transparency entry: 1231324407
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: bostoneco/tinkerloop@77338496079f16d95404915ec34d9a1eafaaae83
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/bostoneco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-wheel.yml@77338496079f16d95404915ec34d9a1eafaaae83
- Trigger Event: release

tinkerloop-ai 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Tinkerloop

Why Tinkerloop

How It Works

Example

Quick Start

From Source

What You Can Check

Multi-Turn and Destructive Scenarios

Tested Against Real Agents

Artifacts

Docs

Status

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance