Skip to main content

Filesystem-first harness optimization for coding agents.

Project description

metaharness

CI Docs Documentation Python License Status Paper

metaharness is an open source Python library for optimizing executable harnesses around agentic coding systems. It is inspired by the Meta Harness paper and is an unofficial open source implementation of the core ideas in that work. The current implementation and benchmark evidence in this repository are centered on the Codex CLI path, including hosted Codex and Codex over local Ollama models.

It is built for teams who want to improve the code and files around an agent workflow, not just the prompt. That includes instruction files, setup flows, validation scripts, test scripts, routing logic, and other executable support code.

Why metaharness

Many agent failures come from the harness around the model:

  • weak repository instructions
  • missing setup steps
  • broken validation logic
  • incomplete test flows
  • poor iteration memory
  • acceptance checks that do not match the real task

metaharness turns those artifacts into a repeatable optimization target with stored evidence for every proposal. It also captures a compact environment snapshot before each proposal so agents do not waste early turns on basic workspace discovery. Projects can also declare an allowed write scope so off-target edits are rejected automatically.

How It Works

metaharness runs an outer optimization loop around a harness:

  1. start from a baseline workspace
  2. ask a coding agent to improve it
  3. validate and evaluate the result
  4. keep the best candidate
  5. store all artifacts on disk

The result is a practical, inspectable workflow for improving real harnesses instead of ad hoc prompt tinkering.

Who It Is For

  • developers building agentic coding systems who want to optimize harness code, workflow scripts, retrieval wrappers, routing, and evaluation flows
  • practitioners using coding-agent tools who want to improve AGENTS.md, GEMINI.md, bootstrap scripts, validation scripts, and acceptance tests

Quickstart

Install the project:

uv sync

Run the fake backend on a real benchmark:

uv run metaharness run examples/python_fixture_benchmark --backend fake --budget 1 --run-name quickstart

Inspect the run:

uv run metaharness inspect examples/python_fixture_benchmark/runs/quickstart

Export the candidate ledger:

uv run metaharness ledger examples/python_fixture_benchmark/runs/quickstart --tsv

Run a saved experiment matrix:

uv run metaharness experiment --config examples/experiment_configs/fake-benchmarks.json

Core Capabilities

  • a minimal optimization engine
  • a filesystem-backed run store
  • automatic environment bootstrap snapshots for each proposal
  • optional write-scope enforcement through allowed_write_paths
  • a provider-neutral proposer backend interface
  • a real CodexExecBackend
  • a deterministic FakeBackend
  • a coding-tool integration for instruction files and script-based harnesses
  • explicit per-candidate outcomes: keep, discard, crash, timeout, no-change, and scope-violation
  • reporting commands for inspect, ledger, summarize, and compare
  • experiment-matrix execution with JSON and TSV outputs
  • benchmark targets and experiment records

Current Status

The repository currently includes:

  • two real coding-tool benchmark targets
  • a smaller deterministic ticket-router example
  • hosted Codex runs on the real benchmarks
  • local Codex over Ollama runs with gpt-oss:20b and gpt-oss:120b
  • a docs site published from GitHub Actions

Current documented experiments in this repository show:

  • hosted Codex solves both real benchmarks in one proposal iteration
  • local gpt-oss:120b solves python_fixture_benchmark
  • local gpt-oss:20b is useful for smoke checks but timed out on the current real benchmark runs

Detailed experiment records:

Provider Status

  • Codex is the main validated harness path in this repository today
  • hosted Codex is the strongest current path for real runs
  • local Codex over Ollama works and has been exercised with gpt-oss:20b and gpt-oss:120b
  • Gemini exists as a scaffolded backend and is not yet at parity with Codex

All real provider results currently documented in this repository were produced through the Codex CLI path. That includes both hosted Codex runs and local Ollama runs driven through Codex with gpt-oss models. Other coding-agent evaluations in the wider ecosystem often emphasize Claude Code and Opus, but this repository's current benchmark evidence is Codex-first.

Documentation

Installation

Project setup:

uv sync

If you want the docs toolchain too:

uv sync --group dev

Check the CLI:

uv run metaharness --help

Editable install with pip also works:

pip install -e .

Hosted Codex

Requirements:

  • codex CLI installed
  • authenticated Codex session or API key
  • outbound network access

Run a real benchmark with hosted Codex:

uv run metaharness run examples/python_fixture_benchmark --backend codex --hosted --budget 1 --run-name hosted-codex

Important:

  • use --hosted when a project config defaults to local Ollama
  • the library is ready for hosted Codex runs today

Local Codex Over Ollama

Probe the local setup:

uv run metaharness smoke codex examples/python_fixture_benchmark --probe-only --oss --local-provider ollama --model gpt-oss:20b

Run with gpt-oss:20b:

uv run metaharness run examples/python_fixture_benchmark --backend codex --oss --local-provider ollama --model gpt-oss:20b --proposal-timeout 240 --budget 1 --run-name ollama-20b

Run with gpt-oss:120b:

uv run metaharness run examples/python_fixture_benchmark --backend codex --oss --local-provider ollama --model gpt-oss:120b --proposal-timeout 420 --budget 1 --run-name ollama-120b

Benchmarks And Examples

Real benchmarks:

Smaller deterministic example:

Run the ticket router example:

uv run python examples/ticket_router/run.py --backend fake --budget 1

Scaffold Your Own Project

Create a coding-tool project:

uv run metaharness scaffold coding-tool ./my-coding-tool-optimizer

Available profiles:

  • standard
  • local-oss-smoke
  • local-oss-medium

Run the scaffold with the fake backend:

uv run metaharness run ./my-coding-tool-optimizer --backend fake --budget 1

CLI Overview

Create a scaffold:

uv run metaharness scaffold coding-tool ./my-project

Run a project:

uv run metaharness run ./my-project --backend fake --budget 1

Probe Codex:

uv run metaharness smoke codex ./my-project --probe-only

Inspect a run:

uv run metaharness inspect ./my-project/runs/example

Compare runs:

uv run metaharness compare \
  ./examples/python_fixture_benchmark/runs/hosted-codex-20260401 \
  ./examples/python_fixture_benchmark/runs/ollama-20b-20260401 \
  ./examples/python_fixture_benchmark/runs/ollama-120b-20260401

Run an experiment matrix:

uv run metaharness experiment --config examples/experiment_configs/fake-benchmarks.json

Benefits Of The Filesystem Approach

Every run stores:

  • prompts
  • candidate workspaces
  • validation results
  • evaluation results
  • proposal metadata
  • workspace diffs
  • per-candidate manifests

That makes the optimization history reviewable, debuggable, and reusable.

Development

Compile checks:

uv run python -m compileall -q src tests examples docs

Unit tests:

uv run python -m unittest discover -s tests -v

Docs build:

uv run mkdocs build --strict

Fake benchmark smoke runs:

uv run metaharness run examples/python_fixture_benchmark --backend fake --budget 1 --run-name ci-fixture-local
uv run metaharness run examples/python_cli_benchmark --backend fake --budget 1 --run-name ci-cli-local
uv run python examples/ticket_router/run.py --backend fake --budget 1

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superagentic_metaharness-0.1.0.tar.gz (49.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

superagentic_metaharness-0.1.0-py3-none-any.whl (48.0 kB view details)

Uploaded Python 3

File details

Details for the file superagentic_metaharness-0.1.0.tar.gz.

File metadata

File hashes

Hashes for superagentic_metaharness-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3ef6ee98417f754fea4833e3d2c1461c403f62b4bb4aab3fe340d689e3098b84
MD5 c19992bcd8475ce16829121166efdb82
BLAKE2b-256 ddc9627433711361f0d4444f4d92427c3ca2e339824e8459f30c6295715aa5da

See more details on using hashes here.

File details

Details for the file superagentic_metaharness-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for superagentic_metaharness-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af14d8ee5d602442b457902e31bdaebbc33deaf722a69241d2eeaa29283daf09
MD5 778451bc000172cff278f96e70c8a039
BLAKE2b-256 235ac18642ce28614cae7f5bc82a3a84619e731555738305c54f13452463762e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page