Skip to main content

Eval-driven test/fix/improve harness for orchestrator-based apps

Project description

Tinkerloop

Tinkerloop to the rescue. If your orchestrator-to-MCP app communication is hard to trust, Tinkerloop gives you a scenario-based loop to reproduce the failure, diagnose it with deterministic checks, patch the target, and rerun until the behavior matches what you expect.

It is an eval-driven harness for testing and improving orchestrator-based apps through repeatable test -> diagnose -> patch -> rerun loops.

Release Status

Tinkerloop is in alpha.

  • The package, CLI, adapters, and report artifacts are usable now.
  • The supported v0.x surface is documented in docs/STABILITY.md.
  • The project is intended for technically strong early adopters who can own a target adapter.
  • It is not yet positioned as a benchmark suite or production-assurance layer.

What It Is

Tinkerloop is not another app-specific bot framework. It is a reusable outer loop for systems that already have:

  • an inner orchestrator model
  • tool or MCP integrations
  • a conversational or API-facing entrypoint

Tinkerloop plays the role of:

  • user simulator
  • integration tester
  • trajectory recorder
  • deterministic judge
  • developer feedback loop driver

Actor Model

There are two distinct roles in a Tinkerloop workflow:

  • inner target orchestrator: the model and tool path inside the app under test
  • outer coding model: the developer tool model using Tinkerloop artifacts to patch and rerun

The outer coding model may analyze results and edit code between runs. It must not replace the inner target orchestrator during a measured run. See docs/ACTOR_MODEL.md.

Who It Is For

  • teams that already have a target app and want deterministic scenario-based regression loops
  • teams that can keep target-specific logic in a target-owned adapter and scenario library
  • teams that want report-driven reruns rather than broad benchmark claims

Who It Is Not For

  • users looking for a zero-config app framework
  • teams that need remote secure-driver support today
  • users who want Tinkerloop to measure general model quality

MVP Scope

Current MVP:

  • load multi-turn scenario files
  • run them against a target app adapter
  • preflight the target app before scenario execution
  • resolve the target app's inner runtime from the target repo boundary
  • trace tool calls by patching configured execution points
  • trace tool calls from target-owned runner commands
  • evaluate deterministic checks
  • write JSON reports for failures and regressions
  • rerun only failed scenarios from report artifacts
  • separate repair-loop and confirmation-loop runs

Not in scope yet:

  • automatic patch generation
  • automatic deploys
  • autonomous code changes without a human gate
  • benchmark claims beyond the configured scenario set
  • secure non-prod target-driver contracts

Quick Start

Tinkerloop supports Python 3.10+. This repo pins 3.12.9 in .python-version for local development with pyenv.

The PyPI distribution name is tinkerloop-ai. Install it with:

python3 -m pip install tinkerloop-ai

If you need to install directly from a GitHub release asset instead:

python3 -m pip install https://github.com/bostoneco/tinkerloop/releases/download/<tag>/tinkerloop_ai-<version>-py3-none-any.whl

Then run it against a target-owned adapter and scenario directory:

tinkerloop \
  run \
  --adapter /path/to/target_adapter.py:create_adapter \
  --user-id <user-id> \
  --scenarios /path/to/scenarios

tinkerloop run exits with code 3 when the repair loop passes. That is intentional: run tinkerloop confirm ... before treating the result as final.

When a candidate fix looks good, run the external confirmation loop:

tinkerloop \
  confirm \
  --adapter /path/to/target_adapter.py:create_adapter \
  --user-id <user-id> \
  --scenarios /path/to/scenarios \
  --non-interactive

If your target repo exposes a more realistic runner or adapter for real-agent validation, use that boundary for confirm instead of the faster repair-loop boundary.

For local development from a source checkout:

pyenv local 3.12.9
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest -q
tinkerloop \
  run \
  --adapter examples/starter_target/adapter.py:create_adapter \
  --user-id demo-user \
  --scenarios examples/starter_target/scenarios

# fuller demo target
tinkerloop \
  run \
  --adapter examples/demo_app/adapter.py:create_adapter \
  --user-id demo-user \
  --scenarios examples/demo_app/scenarios

For real projects, the target repo should own its adapter and scenarios. --adapter accepts either an import path such as your_project.tinkerloop_adapter:create_adapter or a file path such as /path/to/target_adapter.py:create_adapter.

For PythonAppAdapter, each patch_targets entry should point at a callable with the standard tool-call shape (tool_name, user_id, arguments, correlation_id=None). Scenario files must contain at least one turn, and each turn must define a non-empty user prompt.

If the adapter cannot resolve one inner model confidently, Tinkerloop will prompt for a repo-derived candidate in interactive mode. In non-interactive mode, pass explicit overrides:

tinkerloop \
  run \
  --adapter examples/demo_app/adapter.py:create_adapter \
  --user-id demo-user \
  --scenarios examples/demo_app/scenarios \
  --inner-provider <provider> \
  --inner-model <model>

Rerun only failed scenarios from report artifacts:

tinkerloop \
  run \
  --adapter examples/demo_app/adapter.py:create_adapter \
  --user-id demo-user \
  --scenarios examples/demo_app/scenarios \
  --failed-from artifacts/reports

Run only a tagged feature slice:

tinkerloop \
  run \
  --adapter examples/demo_app/adapter.py:create_adapter \
  --user-id demo-user \
  --scenarios examples/demo_app/scenarios \
  --tag cleanup \
  --tag preview

Artifacts written on each run:

  • timestamped report: tinkerloop-<timestamp>.json
  • stable latest report: latest.json
  • stable failure summary: latest-failures.json
  • stable diagnosis payload: latest-diagnosis.json includes confirmation_status for repair-loop vs confirmation-loop visibility
  • confirmation timestamped report: confirm-tinkerloop-<timestamp>.json
  • confirmation latest report: confirm-latest.json
  • confirmation failure summary: confirm-latest-failures.json
  • confirmation diagnosis payload: confirm-latest-diagnosis.json

When a repair run passes, Tinkerloop exits with code 3 and tells you to run tinkerloop confirm .... Repair-only results do not prove agent quality. If confirmation is blocked, Tinkerloop still writes confirm-latest-diagnosis.json with confirmation_status: "blocked" and the preflight error so the attempt is visible in artifacts.

Docs Map

Support Matrix

  • Python: 3.10+
  • Commands: run, confirm
  • Adapter shapes: PythonAppAdapter, CommandAppAdapter
  • Report schemas: tinkerloop.report.v1, tinkerloop.failures.v1, tinkerloop.diagnosis.v1
  • Check types: assistant_contains_all, assistant_contains_any, assistant_not_contains, tool_used, tool_call_count_at_most, tool_call_matches

Repo Layout

  • src/tinkerloop/: reusable harness engine and adapter interfaces
  • examples/: optional example and transition fixtures
  • docs/: charter, architecture, target contract, MVP plan, implementation handoff, and working agreement
  • tests/: Tinkerloop unit tests

Design Rules

  • keep the core small and inspectable
  • prefer deterministic checks before LLM judges
  • keep target-app integration behind adapters
  • no silent magic around tracing, patching, or scenario selection
  • no automatic production actions
  • future target-driver integrations must be non-prod only and secure by default

License

Apache License 2.0. See LICENSE. Business-friendly: use, modify, and distribute with minimal conditions; includes a patent grant.

Contributing

PRs are accepted from maintainers and invited contributors only. For bugs or ideas, open an issue. See CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinkerloop_ai-0.1.5.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinkerloop_ai-0.1.5-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file tinkerloop_ai-0.1.5.tar.gz.

File metadata

  • Download URL: tinkerloop_ai-0.1.5.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tinkerloop_ai-0.1.5.tar.gz
Algorithm Hash digest
SHA256 02602de5387b08c647df938d2bf09232b7a8f32e58fea17ebb88697c07acb9c4
MD5 6284ed7e61318775bd7611d33ea6f60f
BLAKE2b-256 1ec84c5ec2dfcea58624f51cd5cc37eefe2e1521878e88561f3bbb6db40d12f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tinkerloop_ai-0.1.5.tar.gz:

Publisher: release-wheel.yml on bostoneco/tinkerloop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tinkerloop_ai-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: tinkerloop_ai-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tinkerloop_ai-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0ed86ca84c2e44f8f7d55be5630b9ea8b86bb9eae86841b118b84dfed73db6f2
MD5 06cf43c63e268208e5331a19f098fabf
BLAKE2b-256 cef17a9e5baad8a670eea81b2db026ddc73c8974fbe0a8e74a456fcebcfec32f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tinkerloop_ai-0.1.5-py3-none-any.whl:

Publisher: release-wheel.yml on bostoneco/tinkerloop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page