Skip to main content

Transparent feedback compression middleware for LLM coding agents.

Project description

Sieve

Transparent feedback compression middleware for LLM coding agents. Sieve sits between an agent and its tools, parsing tool output and emitting a compact form before it enters the conversation context.

83.9% of tokens in coding-agent trajectories are tool observations (JetBrains, NeurIPS 2025). Most of those are re-read on every subsequent turn. Sieve targets that bloat by parsing — not truncating — the output of common dev tools, then diffing against prior turns so the agent only sees what changed.

Full design in docs/agent-compress-specs.md.

What's in the box

Parsers pytest, Python traceback, mypy, tsc, eslint, gcc/clang, pip, generic fallback
Output formats plain, structured (JSON), XML, minimal
Integrations MCP proxy, SWE-bench Lite paired Cursor / Codex runners
Dependencies none for the library; mcp extra for the proxy (Python ≥ 3.11)

Measured compression

Benchmark over the 22-sample fixture corpus in tests/fixtures/ (uv run python -m benchmarks.run):

Category Samples Raw Compressed Ratio
pytest 7 13,475 1,292 90.4%
pip 2 9,505 138 98.5%
runtime 6 3,150 1,068 66.1%
gcc 1 1,318 721 45.3%
tsc 2 1,264 708 44.0%
generic 1 480 433 9.8%
eslint 1 844 780 7.6%
mypy 2 758 756 0.3%
total 22 30,794 5,896 80.9%

A 5-turn delta scenario where the same pytest failure repeats compresses to 86.3% cumulative — turn 1 is 297 chars, turns 2-5 collapse to 238 chars each ("PYTEST DELTA: unchanged" + still-failing nodeids).

The compressor enforces a never-larger-than-raw invariant on its default plain output: on inputs already smaller than the framing overhead (mypy clean output, ESLint with terse messages, etc.) it passes the raw text through unchanged. The low ratios on those categories aren't a bug — there's nothing to compress; the structured items are still extracted and used for cross-turn delta dedup. (The structured and xml formats wrap the result in a JSON/XML envelope, so on tiny inputs the framed output can exceed the raw size — they trade bytes for machine-readability.)

End-to-end agent run (SWE-bench Lite, Cursor Composer-2)

Paired baseline-vs-sieve trial with Cursor CLI / Composer-2, scored by the official swebench.harness.run_evaluation Docker harness.

baseline sieve
instances scored 4 4
resolved 2 2
resolve rate (of scored) 50.0% 50.0%
patch chars 21,242 19,956
agent-facing chars 47,688 11,613
raw chars 47,688 40,416
compression ratio 0% 71.3%

Resolve rate is unchanged and agent-facing context drops 75.6% (47k → 11.6k chars). This is a small-N pilot (4 instances scored) — treat it as a directional signal, not a statistically robust result. The fixture-corpus numbers above (N=22) are the methodologically clean measurement. Reproduce with:

bash scripts/run_cursor_swe_bench_profiles.sh --resume \
  --eval-with-harness --harness-namespace none
PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
  --baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
  --sieve    artifacts/cursor-swe-bench-lite.sieve.jsonl

CI-Repair-Bench (noisy-logs benchmark)

CI-Repair-Bench is built from real GitHub Actions failures (workflow YAML + long logs). We measure observation compression over the ci-benchmark-user/ci-repair-bench dataset; each observation is workflow + flattened logs with the gold diff excluded, so the metric is diagnostic bulk, not patch leakage. For repair scoring use the paper's upstream harness.

uv sync --group swe-eval   # datasets
uv run python -m benchmarks.ci_repair_bench --compare --json   # all 567 rows

Installation

The library has no runtime dependencies:

pip install agent-sieve            # library + `sieve` / `sieve-run` CLIs
pip install 'agent-sieve[mcp]'     # add the MCP proxy

After install, wire the hooks into your agent harness:

sieve config claude --write        # Claude Code PreToolUse (Bash) hook
sieve config cursor --write        # Cursor preToolUse (Shell) hook
sieve config claude --status       # check what's installed

Both config commands print a dry-run preview by default; add --write to apply (a .sieve.bak backup is made first) and --uninstall to remove.

Quick start

uv run python -m unittest discover -s tests -v
uv run python -m benchmarks.run                   # compression report on fixture corpus

SWE-bench Lite (paired baseline-vs-sieve trials)

Wraps the official swebench.harness.run_evaluation Docker harness. The paired runner builds each instance's harness-container, mounts the workspace at /testbed, runs the agent, and emits a predictions.jsonl per profile. The harness then rewrites each row with the authoritative resolved value.

uv sync --group swe-eval

bash scripts/run_cursor_swe_bench_profiles.sh \
  --manifest benchmarks/manifests/lite_smoke.jsonl \
  --resume --eval-with-harness --harness-namespace none

PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
  --baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
  --sieve    artifacts/cursor-swe-bench-lite.sieve.jsonl

Notes:

  • --harness-namespace none reuses locally-built sweb.eval.x86_64.<id>:latest images instead of pulling swebench/... from Docker Hub.
  • The runner deletes per-instance images and workspaces after each row, and passes --clean True --cache_level base to the harness. Pass --keep-images / --keep-workspaces to retain them.
  • Swap cursor for codex in the script name to use the Codex CLI agent instead.
  • The trajectory-only variant (benchmarks.swe_bench_lite, replays *.traj files for token counts without re-running tests) is still available; it does not run the harness.

benchmarks.swe_bench_compare reports three resolve-rate views: overall (resolved / instances), of-scored (resolved / scored), and scored-rate (scored / instances).

Usage

Direct compression

from sieve import CompressSession

session = CompressSession()
result = session.compress(
    command="pytest tests/",
    stdout=raw_stdout,
    stderr=raw_stderr,
    exit_code=1,
)
print(result.text)              # send this to the LLM
print(result.stats.compression_ratio)

CompressSession keeps state across calls, so the second compress(...) for the same test suite emits a delta against the first.

Decorator wrapper

import subprocess
from sieve import wrap_tool

@wrap_tool
def run_bash(command: str) -> tuple[str, str, int]:
    p = subprocess.run(command, shell=True, capture_output=True, text=True)
    return p.stdout, p.stderr, p.returncode

run_bash(...) now returns the compressed string. Session state is held by the decorator.

Configuration

from sieve import CompressConfig, CompressSession, OutputFormat

session = CompressSession(CompressConfig(
    format=OutputFormat.STRUCTURED,   # plain | structured | xml | minimal
    delta_mode=True,
    include_pattern_hints=True,
    max_raw_lines=50,
))

MCP proxy (no application code changes)

sieve.integrations.mcp is an MCP proxy that wraps any upstream MCP server. The agent talks to the proxy as if it were the upstream; the proxy forwards tools/list and tools/call, and runs every TextContent block in the result through a shared CompressSession before returning it. A single session lives for the proxy's lifetime, so cross-tool delta compression works.

Install the optional dep:

pip install 'agent-sieve[mcp]'

Configure your MCP client (Claude Desktop, Cursor, Continue, etc.) to launch the proxy in place of the upstream. Example wrapping the official server-everything demo server (npx downloads it on first run):

// Claude Desktop's mcp_servers config / Cursor .cursor/mcp.json
{
  "mcpServers": {
    "sieve-demo": {
      "command": "python",
      "args": [
        "-m", "sieve.integrations.mcp",
        "--",
        "npx", "-y", "@modelcontextprotocol/server-everything"
      ]
    }
  }
}

Use whatever upstream you already rely on (e.g. @modelcontextprotocol/server-filesystem with the paths your client expects); there is no published @modelcontextprotocol/server-bash on npm.

That's the entire integration: the agent sees the same upstream tools, but every TextContent result has been compressed.

For non-shell tools, the proxy uses the tool name as the parser-router hint; for shell-like tools that take a command / cmd / shellCommand argument, the actual command string is forwarded so parser detection (pytest, mypy, etc.) works correctly.

What it looks like

Illustrative example (hand-written to show the shape of the transform, not a captured fixture):

Raw pytest run with two failures (1,818 chars):

============================= test session starts ==============================
platform linux -- Python 3.12.0, pytest-8.1.1, pluggy-1.4.0
... [40 lines of header + per-test output] ...
=================================== FAILURES ===================================
________________________________ test_user_update ________________________________
    def test_user_update(self):
        ...
>       assert response.status_code == 200
E       AssertionError: assert 403 == 200
tests/test_views.py:89: AssertionError
... [equivalent block for test_user_delete] ...
=========================== short test summary info ============================
FAILED tests/test_views.py::TestUserViewSet::test_user_update - AssertionError
FAILED tests/test_views.py::TestUserViewSet::test_user_delete - AssertionError
========================= 2 failed, 140 passed, 0 warnings ====================

After Sieve (297 chars, 83.7% reduction):

PYTEST: 2 failed, 140 passed (142 total)
FAIL tests/test_views.py::TestUserViewSet::test_user_update (test_views.py:89)
  expected 200, got 403
FAIL tests/test_views.py::TestUserViewSet::test_user_delete (test_views.py:102)
  expected 204, got 403
Pattern: All failures return 403 in test_views.py

On the next turn, the agent fixes test_user_update and re-runs:

PYTEST DELTA (turn 2)
PASS tests/test_views.py::TestUserViewSet::test_user_update now passes
STILL FAIL tests/test_views.py::TestUserViewSet::test_user_delete (line 102) - expected 204, got 403
Result: 1 failed, 141 passed (142 total)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_sieve-0.1.0.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_sieve-0.1.0-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file agent_sieve-0.1.0.tar.gz.

File metadata

  • Download URL: agent_sieve-0.1.0.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agent_sieve-0.1.0.tar.gz
Algorithm Hash digest
SHA256 176bc242ba16f853345961fe0689d21d8c832c217007610c3f7c804db98aaff2
MD5 ed77a7514ad45ea36a0c7955b0e7866c
BLAKE2b-256 ab0dcc8458ccea294fbf0cb37d16044db98e88b2cb10e20247423098f51e0003

See more details on using hashes here.

File details

Details for the file agent_sieve-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_sieve-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agent_sieve-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76da600f0b40be058145dfff9ec8b0760f16e09836771aa4a7bfba8ea5cda6a5
MD5 fedac2769dde40faca4509b231519a16
BLAKE2b-256 e96e216ec391c434c2757077202e67d22ecbe23ac140333df8bc83cf4b71b724

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page