tp — non-invasive CLI testing framework for AI workflows. Submits results to trapstreet.run.
Project description
trap — CLI
Lives at
trapstreet-mvp/cli/— part of the trapstreet monorepo. The standalone repoAntiNoise-ai/trapis retained for history only; all active development happens here.
Install (from PyPI — recommended):
uv tool install trapstreet-cli
# also works via pipx / pip
From git (latest main, no PyPI release needed):
uv tool install "git+https://github.com/AntiNoise-ai/trapstreet-mvp.git#subdirectory=cli"
The command name is tp. Releases are git-tagged cli-vX.Y.Z; tagging
triggers PyPI publish via GitHub Actions.
trap is a non-invasive CLI testing framework for AI prompts, agents, and workflows. It treats the program under test (the "solution") as a black box: it invokes the solution as a subprocess, captures stdout/stderr/files, then optionally pipes that output through a "judge" (per-case scorer) and a "grader" (overall aggregator) — also subprocesses, also language-agnostic.
The framework knows nothing about how the solution is implemented. Python, shell scripts, compiled binaries, agentic pipelines — anything that can be invoked from a shell works.
Core idea: solution and task are decoupled
Two roles, two repos (or two directories), connected only by a small IO contract:
| Role | Owns | Configures |
|---|---|---|
| Solution author | trap.yaml, the solution code |
how to invoke the solution, which inputs to feed it, which outputs it produces |
| Task author | traptask.yaml, judge.py, grader.py, inputs/, expected/ |
the test cases, scoring logic, expected outputs |
The solution doesn't need to import trap or know it exists. It reads paths from two environment variables (INPUTS, OUTPUTS) and runs.
inputs/{case_id}/ ──[INPUTS env var]──▶ solution ──[OUTPUTS env var]──▶ .trap/{task}/{ts}/{case_id}/
expected/{case_id}/ │
│ │
└──────────────────────── judge ◀──────────────────────────────────────────┘
│
{metrics: any JSON}
│
[collect all cases, hand to grader]
│
grader
│
{passed, score, ...}
Install
Requires Python >=3.14 and uv.
git clone https://github.com/AntiNoise-ai/trap
cd trap
uv sync
The installed entry point is tp (not trap), declared in pyproject.toml:
uv run tp --help
There is no PyPI release yet, so use uv run tp … from a clone, or run it from a wheel you build locally (uv build).
Quick start — the echo example
The repo ships two complete worked examples under examples/. Walk through examples/echo/ to see the moving pieces.
1. Solution side — examples/echo/solution/
echo.py reads JSON from stdin and prints message to stdout (or errors with exit 1 if message is missing):
import json, sys
data = json.load(sys.stdin)
if "message" not in data:
print("error: missing 'message' field", file=sys.stderr)
sys.exit(1)
print(data["message"])
trap.yaml tells trap how to run it and where the task lives:
tasks:
test:
description: Echo solution — reads stdin JSON, writes it back to stdout
cmd: uv run python echo.py
traptask: ../task # path to the task directory (relative to trap.yaml)
inputs:
stdin: input.json # pipe inputs/{case_id}/input.json into stdin
2. Task side — examples/echo/task/
traptask.yaml lists the cases and points at the judge/grader:
dirs:
inputs: inputs/ # optional; this is the default
expected: expected/ # optional; this is the default
cases:
- id: contains_basic
description: stdout contains the substring (case-insensitive)
tags: [smoke]
- id: exit_code_failure
description: exit code is 1 when message field is missing
- id: skipped_example
skip: true
tags: [wip]
judge:
cmd: .venv/bin/python judge.py # optional — omit for output-only mode
grader:
cmd: .venv/bin/python grader.py # optional — omit to skip aggregation
Each case has a directory under inputs/{id}/ (and optionally expected/{id}/) holding whatever files that case needs.
judge.py reads the payload from TRAPTASK_PAYLOAD, evaluates one case, prints a JSON metric to stdout:
import json, os, re
from pathlib import Path
data = json.loads(os.environ["TRAPTASK_PAYLOAD"])
stdout = Path(data["outputs"]["case_stdout"]).read_text().strip()
exit_code = json.loads(Path(data["outputs"]["case_meta.json"]).read_text())["exit_code"]
expected = json.loads(Path(data["expected"]["expected.json"]).read_text())
# … compute results …
print(json.dumps({"score": score}))
grader.py receives the list of all case results and emits the overall verdict:
import json, os
results = json.loads(os.environ["TRAPTASK_PAYLOAD"])
passed = all(r["metrics"]["score"] == 1.0 for r in results)
print(json.dumps({"passed": passed, "score": avg_score}))
3. Run it
From examples/echo/solution/:
uv run tp run # run the first task in trap.yaml
uv run tp run test # run the named task
uv run tp run -t smoke # only run cases tagged `smoke`
uv run tp run --output json # print machine-readable JSON instead of a rich table
uv run tp run --fail-fast # stop on first case whose judge score < 1.0
Trap writes per-run artifacts under .trap/{task}/{timestamp}/ and updates a latest symlink alongside it.
CLI reference
tp run [TASK] [OPTIONS] # execute a task
tp report [TASK] [RUN] # re-print the report for a stored run (defaults to `latest`)
tp init # scaffold trap.yaml + traptask.yaml — NOT YET IMPLEMENTED
tp run options
| Flag | Default | Purpose |
|---|---|---|
TASK (positional) |
first task in trap.yaml |
which task to run |
--config / -c |
trap.yaml |
path to the trap config |
--tag / -t |
(none) | filter cases by tag; repeatable |
--output / -o |
rich |
report renderer: rich or json |
--fail-fast |
false |
stop after the first case whose judge score < 1.0 |
--workspace / -w |
.trap |
where to write run artifacts |
tp report options
Re-renders a previously stored run from disk. Same --config / --output / --workspace flags; the RUN argument is the timestamp directory name, or latest (default).
Exit codes
0— every case exited 0, and (if a grader ran)metrics.passedis notFalse.1— at least one case had a non-zero exit code, or the grader returned{"passed": false}.
Configuration reference
trap.yaml (solution author)
tasks:
test: # task name; arbitrary
description: optional # shown in the report
cmd: uv run python solution.py
traptask: ../task # path to the task dir (contains traptask.yaml)
inputs: # optional
stdin: input.txt # filename in inputs/{case_id}/ to pipe as stdin
files: # optional: filenames to assert exist before running
- config.json
file_outputs: # files the solution promises to write
- result.json
timeout: 30 # seconds; default 30
inputs_envvar: INPUTS # override the env var name if you want
outputs_envvar: OUTPUTS
run: # second task; same traptask, different cmd or inputs
cmd: uv run python solution.py
traptask: ../task
inputs:
stdin: input.txt
file_outputs:
- result.json
tasks:is a mapping; each key is a task name you can pass totp run.traptaskis required for every task — it points at the directory containingtraptask.yaml.cmdis parsed viashlex.split, run with the trap.yaml's directory ascwd.
traptask.yaml (task author)
The entire file is optional. If traptask.yaml is absent, trap scans inputs/ and treats each subdirectory as a case in output-only mode (no judge, no grader, no expected). With it:
dirs:
inputs: inputs/ # default
expected: expected/ # default
cases:
- id: contains_basic # must match an inputs/<id>/ directory
description: optional
tags: [smoke] # for `tp run -t smoke`
- id: skipped_example
skip: true # case is not executed
judge: # optional; omit for output-only mode
cmd: .venv/bin/python judge.py
payload_envvar: TRAPTASK_PAYLOAD # default; override if you must
grader: # optional; omit to skip aggregation
cmd: .venv/bin/python grader.py
judge.cmd and grader.cmd run with cwd set to the task directory.
The IO contract
Trap injects environment variables at three points. Values are always JSON strings (not file paths) so consumers can json.loads(os.environ[…]) directly.
Solution-side: INPUTS and OUTPUTS
Before running each case, trap injects:
// INPUTS — every file in inputs/{case_id}/
{
"input.json": "/abs/path/task/inputs/contains_basic/input.json",
"config.json": "/abs/path/task/inputs/contains_basic/config.json"
}
// OUTPUTS — every filename declared in trap.yaml `file_outputs`
{
"result.json": "/abs/path/.trap/test/2026-05-09T14:30:00/contains_basic/result.json"
}
Keys are full filenames with extension. Values are absolute paths. The solution reads INPUTS["foo.json"], writes to OUTPUTS["result.json"]. If you have nothing to read or write via files you can still receive content on stdin via inputs.stdin in trap.yaml.
stdout, stderr, and meta.json are captured automatically — the solution never writes to OUTPUTS["case_stdout"] itself.
Judge-side: TRAPTASK_PAYLOAD
For each case, the judge receives a JSON string with three namespaces:
{
"inputs": { "input.json": "/abs/path/task/inputs/case1/input.json" },
"outputs": {
"case_stdout": "/abs/path/.trap/test/.../case1/case_stdout",
"case_stderr": "/abs/path/.trap/test/.../case1/case_stderr",
"case_meta.json": "/abs/path/.trap/test/.../case1/case_meta.json"
},
"expected": { "expected.json": "/abs/path/task/expected/case1/expected.json" }
}
Note the captured outputs are keyed case_stdout, case_stderr, case_meta.json (prefixed) — that's what the runner writes to disk and what the example judge.py reads. case_meta.json contains {"exit_code": N, "duration": seconds}.
The judge prints free-form JSON to stdout. Trap stores it verbatim as CaseResult.metrics. Convention: include a numeric score field if you want --fail-fast to be meaningful (it checks metrics.score < 1.0).
Grader-side: TRAPTASK_PAYLOAD
The grader receives the full list of per-case results as JSON:
[
{"case_id": "contains_basic", "exit_code": 0, "duration": 0.12, "metrics": {"score": 1.0}, "skipped": false},
{"case_id": "exact_match", "exit_code": 0, "duration": 0.11, "metrics": {"score": 0.5}, "skipped": false}
]
It prints free-form JSON to stdout. Convention: include passed: bool (used for the exit-code check) and score: float (used in the report header).
The .trap/ workspace
.trap/
└── {task_name}/
├── latest -> 2026-05-09T14:30:00/
└── 2026-05-09T14:30:00/
├── {case_id}/
│ ├── case_stdout
│ ├── case_stderr
│ ├── case_meta.json # {"exit_code": 0, "duration": 0.12}
│ ├── judge_stdout # raw judge output (if judge ran)
│ ├── judge_stderr
│ ├── judge_meta.json
│ └── {any declared file_outputs}
├── grader_stdout # raw grader output (if grader ran)
├── grader_stderr
├── grader_meta.json
└── report.json # the rendered/JSON report for this run
Use tp report to re-display a stored run without re-executing the solution.
Three running modes
Choose by what you put (or don't put) in traptask.yaml:
| Mode | judge |
grader |
Pass/fail signal |
|---|---|---|---|
| Output-only | absent | absent | any case exit_code != 0 → exit 1 |
| Per-case scoring | present | absent | same as above |
| Full evaluation | present | present | grader's metrics.passed == false → exit 1 |
In output-only mode you can even omit traptask.yaml entirely — trap will discover cases by scanning inputs/ subdirectories.
Current limitations
tp initis a stub; scaffolding is not implemented yet.- Cases run sequentially (the
TaskRunner._itergenerator is deliberately left as a seam for future parallelization). - No PyPI release; install from source.
- Python 3.14 minimum.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trapstreet_cli-0.2.0.tar.gz.
File metadata
- Download URL: trapstreet_cli-0.2.0.tar.gz
- Upload date:
- Size: 18.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2d19c209fd290187b034a186bd4fbd108b46ace35e6feb12340b4c5a17bade6
|
|
| MD5 |
9516b62bd07973f391e7053369db766d
|
|
| BLAKE2b-256 |
3440ef81eee44f6d84dc4f24ac6fcf89604acd4d83ac244245849c7c85fdabe3
|
File details
Details for the file trapstreet_cli-0.2.0-py3-none-any.whl.
File metadata
- Download URL: trapstreet_cli-0.2.0-py3-none-any.whl
- Upload date:
- Size: 23.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61eaf9d7354ce72d8459452d8871f71cad531d646cdb20e0c05fb1743eb0f280
|
|
| MD5 |
6ea9dcb99f3458399fe88eec3998ae8d
|
|
| BLAKE2b-256 |
23afe7d4bb715eb2461b16e757ffdb2dd3ee8eaa09a2eb2f7d4eb819800d18ce
|