Transparent feedback compression middleware for LLM coding agents.
Project description
Sieve
Transparent feedback compression middleware for LLM coding agents. Sieve sits between an agent and its tools, parsing tool output and emitting a compact form before it enters the conversation context.
83.9% of tokens in coding-agent trajectories are tool observations (JetBrains, NeurIPS 2025). Most of those are re-read on every subsequent turn. Sieve targets that bloat by parsing — not truncating — the output of common dev tools, then diffing against prior turns so the agent only sees what changed.
Full design in docs/agent-compress-specs.md.
What's in the box
| Parsers | pytest, Python traceback, mypy, tsc, eslint, gcc/clang, pip, generic fallback |
| Output formats | plain, structured (JSON), XML, minimal |
| Integrations | MCP proxy, SWE-bench Lite paired Cursor / Codex runners |
| Dependencies | none for the library; mcp extra for the proxy (Python ≥ 3.11) |
Measured compression
Benchmark over the 22-sample fixture corpus in tests/fixtures/ (uv run python -m benchmarks.run):
| Category | Samples | Raw | Compressed | Ratio |
|---|---|---|---|---|
| pytest | 7 | 13,475 | 1,292 | 90.4% |
| pip | 2 | 9,505 | 138 | 98.5% |
| runtime | 6 | 3,150 | 1,068 | 66.1% |
| gcc | 1 | 1,318 | 721 | 45.3% |
| tsc | 2 | 1,264 | 708 | 44.0% |
| generic | 1 | 480 | 433 | 9.8% |
| eslint | 1 | 844 | 780 | 7.6% |
| mypy | 2 | 758 | 756 | 0.3% |
| total | 22 | 30,794 | 5,896 | 80.9% |
A 5-turn delta scenario where the same pytest failure repeats compresses to 86.3% cumulative — turn 1 is 297 chars, turns 2-5 collapse to 238 chars each ("PYTEST DELTA: unchanged" + still-failing nodeids).
The compressor enforces a never-larger-than-raw invariant on its default plain output: on inputs already smaller than the framing overhead (mypy clean output, ESLint with terse messages, etc.) it passes the raw text through unchanged. The low ratios on those categories aren't a bug — there's nothing to compress; the structured items are still extracted and used for cross-turn delta dedup. (The structured and xml formats wrap the result in a JSON/XML envelope, so on tiny inputs the framed output can exceed the raw size — they trade bytes for machine-readability.)
End-to-end agent run (SWE-bench Lite, Cursor Composer-2)
Paired baseline-vs-sieve trial with Cursor CLI / Composer-2, scored by the official swebench.harness.run_evaluation Docker harness.
| baseline | sieve | |
|---|---|---|
| instances scored | 4 | 4 |
| resolved | 2 | 2 |
| resolve rate (of scored) | 50.0% | 50.0% |
| patch chars | 21,242 | 19,956 |
| agent-facing chars | 47,688 | 11,613 |
| raw chars | 47,688 | 40,416 |
| compression ratio | 0% | 71.3% |
Resolve rate is unchanged and agent-facing context drops 75.6% (47k → 11.6k chars). This is a small-N pilot (4 instances scored) — treat it as a directional signal, not a statistically robust result. The fixture-corpus numbers above (N=22) are the methodologically clean measurement. Reproduce with:
bash scripts/run_cursor_swe_bench_profiles.sh --resume \
--eval-with-harness --harness-namespace none
PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
--baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
--sieve artifacts/cursor-swe-bench-lite.sieve.jsonl
CI-Repair-Bench (noisy-logs benchmark)
CI-Repair-Bench is built from real GitHub Actions failures (workflow YAML + long logs). We measure observation compression over the ci-benchmark-user/ci-repair-bench dataset; each observation is workflow + flattened logs with the gold diff excluded, so the metric is diagnostic bulk, not patch leakage. For repair scoring use the paper's upstream harness.
uv sync --group swe-eval # datasets
uv run python -m benchmarks.ci_repair_bench --compare --json # all 567 rows
Installation
The library has no runtime dependencies:
pip install agent-sieve # library + `sieve` / `sieve-run` CLIs
pip install 'agent-sieve[mcp]' # add the MCP proxy
After install, wire the hooks into your agent harness:
sieve config claude --write # Claude Code PreToolUse (Bash) hook
sieve config cursor --write # Cursor preToolUse (Shell) hook
sieve config claude --status # check what's installed
Both config commands print a dry-run preview by default; add --write to apply
(a .sieve.bak backup is made first) and --uninstall to remove.
Quick start
uv run python -m unittest discover -s tests -v
uv run python -m benchmarks.run # compression report on fixture corpus
SWE-bench Lite (paired baseline-vs-sieve trials)
Wraps the official swebench.harness.run_evaluation Docker harness. The paired runner builds each instance's harness-container, mounts the workspace at /testbed, runs the agent, and emits a predictions.jsonl per profile. The harness then rewrites each row with the authoritative resolved value.
uv sync --group swe-eval
bash scripts/run_cursor_swe_bench_profiles.sh \
--manifest benchmarks/manifests/lite_smoke.jsonl \
--resume --eval-with-harness --harness-namespace none
PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
--baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
--sieve artifacts/cursor-swe-bench-lite.sieve.jsonl
Notes:
--harness-namespace nonereuses locally-builtsweb.eval.x86_64.<id>:latestimages instead of pullingswebench/...from Docker Hub.- The runner deletes per-instance images and workspaces after each row, and passes
--clean True --cache_level baseto the harness. Pass--keep-images/--keep-workspacesto retain them. - Swap
cursorforcodexin the script name to use the Codex CLI agent instead. - The trajectory-only variant (
benchmarks.swe_bench_lite, replays*.trajfiles for token counts without re-running tests) is still available; it does not run the harness.
benchmarks.swe_bench_compare reports three resolve-rate views: overall (resolved / instances), of-scored (resolved / scored), and scored-rate (scored / instances).
Usage
Direct compression
from sieve import CompressSession
session = CompressSession()
result = session.compress(
command="pytest tests/",
stdout=raw_stdout,
stderr=raw_stderr,
exit_code=1,
)
print(result.text) # send this to the LLM
print(result.stats.compression_ratio)
CompressSession keeps state across calls, so the second compress(...) for the same test suite emits a delta against the first.
Decorator wrapper
import subprocess
from sieve import wrap_tool
@wrap_tool
def run_bash(command: str) -> tuple[str, str, int]:
p = subprocess.run(command, shell=True, capture_output=True, text=True)
return p.stdout, p.stderr, p.returncode
run_bash(...) now returns the compressed string. Session state is held by the decorator.
Configuration
from sieve import CompressConfig, CompressSession, OutputFormat
session = CompressSession(CompressConfig(
format=OutputFormat.STRUCTURED, # plain | structured | xml | minimal
delta_mode=True,
include_pattern_hints=True,
max_raw_lines=50,
))
MCP proxy (no application code changes)
sieve.integrations.mcp is an MCP proxy that wraps any upstream MCP server. The agent talks to the proxy as if it were the upstream; the proxy forwards tools/list and tools/call, and runs every TextContent block in the result through a shared CompressSession before returning it. A single session lives for the proxy's lifetime, so cross-tool delta compression works.
Install the optional dep:
pip install 'agent-sieve[mcp]'
Configure your MCP client (Claude Desktop, Cursor, Continue, etc.) to launch the proxy in place of the upstream. Example wrapping the official server-everything demo server (npx downloads it on first run):
// Claude Desktop's mcp_servers config / Cursor .cursor/mcp.json
{
"mcpServers": {
"sieve-demo": {
"command": "python",
"args": [
"-m", "sieve.integrations.mcp",
"--",
"npx", "-y", "@modelcontextprotocol/server-everything"
]
}
}
}
Use whatever upstream you already rely on (e.g. @modelcontextprotocol/server-filesystem with the paths your client expects); there is no published @modelcontextprotocol/server-bash on npm.
That's the entire integration: the agent sees the same upstream tools, but every TextContent result has been compressed.
For non-shell tools, the proxy uses the tool name as the parser-router hint; for shell-like tools that take a command / cmd / shellCommand argument, the actual command string is forwarded so parser detection (pytest, mypy, etc.) works correctly.
What it looks like
Illustrative example (hand-written to show the shape of the transform, not a captured fixture):
Raw pytest run with two failures (1,818 chars):
============================= test session starts ==============================
platform linux -- Python 3.12.0, pytest-8.1.1, pluggy-1.4.0
... [40 lines of header + per-test output] ...
=================================== FAILURES ===================================
________________________________ test_user_update ________________________________
def test_user_update(self):
...
> assert response.status_code == 200
E AssertionError: assert 403 == 200
tests/test_views.py:89: AssertionError
... [equivalent block for test_user_delete] ...
=========================== short test summary info ============================
FAILED tests/test_views.py::TestUserViewSet::test_user_update - AssertionError
FAILED tests/test_views.py::TestUserViewSet::test_user_delete - AssertionError
========================= 2 failed, 140 passed, 0 warnings ====================
After Sieve (297 chars, 83.7% reduction):
PYTEST: 2 failed, 140 passed (142 total)
FAIL tests/test_views.py::TestUserViewSet::test_user_update (test_views.py:89)
expected 200, got 403
FAIL tests/test_views.py::TestUserViewSet::test_user_delete (test_views.py:102)
expected 204, got 403
Pattern: All failures return 403 in test_views.py
On the next turn, the agent fixes test_user_update and re-runs:
PYTEST DELTA (turn 2)
PASS tests/test_views.py::TestUserViewSet::test_user_update now passes
STILL FAIL tests/test_views.py::TestUserViewSet::test_user_delete (line 102) - expected 204, got 403
Result: 1 failed, 141 passed (142 total)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_sieve-0.1.0.tar.gz.
File metadata
- Download URL: agent_sieve-0.1.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
176bc242ba16f853345961fe0689d21d8c832c217007610c3f7c804db98aaff2
|
|
| MD5 |
ed77a7514ad45ea36a0c7955b0e7866c
|
|
| BLAKE2b-256 |
ab0dcc8458ccea294fbf0cb37d16044db98e88b2cb10e20247423098f51e0003
|
File details
Details for the file agent_sieve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_sieve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 43.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76da600f0b40be058145dfff9ec8b0760f16e09836771aa4a7bfba8ea5cda6a5
|
|
| MD5 |
fedac2769dde40faca4509b231519a16
|
|
| BLAKE2b-256 |
e96e216ec391c434c2757077202e67d22ecbe23ac140333df8bc83cf4b71b724
|