High-throughput parallel LLM agent execution with tool deduplication, structured output, and self-hosted inference support.

These details have not been verified by PyPI

Project links

Project description

Batch Agent SDK

Run N LLM agents in parallel. Input list in, result list out. Every number on this page comes from a tests/benchmarks/results/*/results.json file or LOGS.md.

When to use this

Use BatchAgent when you need to run the same LLM task against N inputs (10–500) and want:

Results as they arrive, not after the slowest agent finishes
Tool calls deduplicated across agents (N agents reading the same file → 1 actual read)
Structured output validated, retried, and returned as data — not exceptions
Priority scheduling that drains near-complete agents before starting fresh ones

Do not use BatchAgent when you have a single task, a single turn, no shared system prompt, or you are on a commercial API with strict rate limits. For those cases the raw asyncio.gather approach is simpler and faster.

Install

pip install batch-agent            # core: httpx + pydantic
pip install "batch-agent[bedrock]" # + boto3 for AWS Bedrock
pip install "batch-agent[vllm]"    # + vllm for self-hosted inference
pip install "batch-agent[redis]"   # + redis for distributed mode (Phase 4)

Requires Python ≥ 3.10.

Quickstart

from batch_agent import BatchAgent, Tool
from pydantic import BaseModel

class PaperSummary(BaseModel):
    benchmark_name: str | None
    primary_metric: str
    models_tested: list[str]
    summary: str

results = await BatchAgent.run(
    system_prompt="You are a precise scientific summarizer.",
    task="Extract benchmark info from this paper:\n\n{paper_text}",
    inputs=[{"paper_text": text} for text in papers],
    tools=[Tool.read_file, Tool.web_search],
    output_schema=PaperSummary,
    model="meta-llama/Llama-3.1-70B-Instruct",
    backend="vllm://localhost:8000",
    max_inflight=64,
    max_turns=4,
    max_retries=2,
    on_result=lambda r: print(r.output),
)

Stream results as they arrive:

async for result in BatchAgent.stream(task=..., inputs=...):
    if result.ok:
        process(result.output)   # Pydantic model, arrives as each agent finishes
    else:
        log_error(result.error)  # structured AgentError, not an exception

AWS Bedrock (no self-hosted GPU required):

results = await BatchAgent.run(
    task="Summarize: {text}",
    inputs=[{"text": t} for t in texts],
    model="anthropic.claude-sonnet-4-20250514-v1:0",
    backend="bedrock://us-east-1",
    max_inflight=2,          # start low; AIMD controller raises it automatically
)

Reduce (map → aggregate):

results, summary = await BatchAgent.run_with_reduce(
    task="Extract claims from: {text}",
    inputs=[{"text": t} for t in texts],
    reduce="You received {n} claim lists. Deduplicate and rank by evidence.",
    output_schema=ClaimList,
    reduce_schema=RankedClaimList,
)

Benchmarks

All numbers are from live hardware runs or deterministic mock runs. Source files listed per table.

Table 1 — Fair comparison: BatchAgent SDK vs naive asyncio.gather

Mock backend: 60 ms per forward pass, 200 ms per tool call, asyncio.sleep (fully parallel). Source: tests/benchmarks/results/fair_comparison/results.json.

Both configs do identical work: multi-turn loop, one tool call per agent, Pydantic output validation, retry on failure.

Config	N	Wall (s)	agents/s	OK%	Tool reads	User code
D — naive `asyncio.gather`	50	0.65	76.9	100%	51	87 lines
E — BatchAgent	50	3.46	14.5	100%	2 (inflight dedup)	9 lines
D — naive `asyncio.gather`	200	0.66	303	100%	204	87 lines
E — BatchAgent	200	3.48s	57.5	100%	4 (inflight dedup)	9 lines

Key findings:

9.7× fewer lines of user code (87 vs 9, programmatically verified)
50× tool dedup at N=200 (inflight Future mechanism, cacheable=False, no LRU)
E wall-clock is ~5× higher because the scheduler does more per agent (retry, validation, KVFlow, state). For the equivalent raw throughput the naive approach is faster but requires the user to implement all of that themselves.

Table 2 — Live GPU: A10G 23 GB, Qwen/Qwen2.5-7B-Instruct, vLLM 0.20.1

Source: tests/benchmarks/results/config_d_20/results.json, config_d_200/results.json, config_e_200/results.json, fair_comparison_live/results.json.

Config	N	Wall (s)	agents/s	TTFT P50	Tool reads	Cache hit%
D naive (single-turn, file in prompt)	20	0.46	43.3	0.208s	20	83.5%
D naive (single-turn)	200	2.67	74.8	0.979s	200	63.4%
E BatchAgent (2-turn + tool, old scheduler)	200	36.5	5.5	—	1 (200x dedup)	93.0%
E BatchAgent (2-turn + tool, backpressure dispatch)	200	21.8s	9.2	—	200	90.8%

Key findings:

40% wall-clock improvement from backpressure dispatch fix (36.5s → 21.8s)
93.0% prefix cache hit rate at N=200 (from config_e_200/results.json)
200x tool dedup in the live GPU run (cacheable=True + inflight, all 200 agents dispatched simultaneously)
D naive N=200 does not OOM or timeout — vLLM queues all 200 simultaneously
TTFT P50 degrades 4.7× from N=20 to N=200 (0.208s → 0.979s) under naive gather due to queue depth

Note: E is slower per-agent than D on live GPU because E does 2 forward passes per agent (tool-call round-trip) while D does 1. The comparison is not identical work; see Table 1 for the controlled mock comparison.

How it works

User: BatchAgent.run(task, inputs, tools, output_schema)
        │
        ▼
TaskCompiler → ExecutionPlan (shared prefix extracted, schema injected)
        │
        ▼
WaveScheduler (asyncio)
  ├─ PrioritySemaphore (max_inflight) — near-done agents served first
  ├─ BackpressureController — pauses dispatch when backend queue fills
  ├─ KVFlowAdvisor — emits prefetch hints to vLLM before agents reactivate
  └─ per-agent loop:
       turn 1: acquire semaphore → generate → release semaphore
       if tool_use: TOOL_WAIT (semaphore FREE) → ToolPool.call → inject result
       turn 2: acquire → generate → release
       → parse_and_validate_output(Pydantic) → AgentResult
        │
        ▼
ToolPool
  ├─ inflight Future dedup (N concurrent callers → 1 execution)
  ├─ LRU cache (cacheable=True tools)
  ├─ token-bucket rate limiter per tool
  └─ @batchable SQL grouping (N WHERE id=? → 1 IN (...))
        │
        ▼
BackendAdapter (Anthropic / OpenAI / vLLM / SGLang / Bedrock)

The semaphore wraps only the inference call, not tool waits. This is W5 from the spec — it is the reason GPU utilisation stays near 100% instead of collapsing when agents wait for external tools.

Backends

From BackendAdapter.backend_capabilities():

Backend	URL format	prefix_pinning	kvflow	diff_kv	max_safe_concurrent
Anthropic API	`anthropic://`	✗	✗	✗	5
OpenAI API	`openai://host`	✗	✗	✗	5
vLLM (self-hosted)	`vllm://host:8000`	✓	✓	✓	64
SGLang (self-hosted)	`sglang://host:30000`	✗	✓	✗	64
AWS Bedrock	`bedrock://region/model`	✗	✗	✗	1–3

Bedrock-specific notes (from tests/benchmarks/results/bedrock_cache_isolation/results.json):

Prompt caching (cachePoint) is confirmed active when system prompt ≥ ~1,024 tokens
Prompt caching saves tokens but not latency at <8K token prompts — Bedrock managed queue/model latency dominates prefill savings (confirmed across 3 isolation variants: sequential, region-swap, parallel)
Default concurrency = 1; the AIMD controller increases it automatically after 60 s without throttling
Bedrock mode value: tool deduplication, structured output validation, retry handling, prompt cache management — not GPU scheduling efficiency

Limitations

These are not on the roadmap to soften — they are facts about the current implementation.

vLLM/GPU results are for 7B models on a single A10G. The throughput numbers in Table 2 are ~10 agents/sec for 2-turn tasks. For 70B models or larger batches a different GPU configuration is needed. The 70B benchmark has not been run; publish is blocked on it.
Tool dedup only fires for concurrent callers. If N agents call the same tool but complete their first forward pass at different times (as happens on real GPU with sequential-ish inference), each agent may see a different inflight window and execute the tool independently. The 200x dedup in the GPU run required cacheable=True. The mock shows 50x with cacheable=False because the mock is fully parallel.
Bedrock TTFT does not improve with prompt caching at <8K tokens. Cache-miss P50: 2.32s. Cache-hit P50: 3.24s. Hit/miss ratio: 1.40 (hit is slower, not faster). This is reproducible across 10 sequential identical requests. Source: tests/benchmarks/results/bedrock_cache_isolation/results.json.
Distributed mode (Phase 4) is a prototype. The RedisStreamsStateStore and DistributedWaveScheduler are implemented and unit-tested with a mock Redis, but have not been tested against a real Redis cluster. The 1,000-agent benchmark requires 4 nodes and has not been run.

Roadmap

Phase	Status	Description
0 — Foundation	✅ done	Multi-turn loop, W5 semaphore fix, tool coalescing, Anthropic backend
1 — Inference integration	✅ done	vLLM native mode, prefix warming, priority queue
2 — Scale + robustness	✅ done	500-agent benchmark, retry, compaction, checkpointing, reduce
3A — KVFlow prefetch	✅ done (mock)	KVFlowAdvisor, backpressure dispatch; vLLM patch route written, GPU test pending
3B — TokenDance diff KV	✅ done (mock)	18.76× compression in synthetic test; vLLM patch not deployed
3C — SGLang backend	✅ done (mock)	Full adapter; live GPU test pending
4 — Distributed	✅ prototype	Mock Redis; real cluster test and 1,000-agent benchmark pending
Publish	blocked	Waiting on: 70B GPU benchmark, live vLLM KVFlow measurement, cost-per-task comparison vs naive API

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 11, 2026

This version

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batch_agent-0.1.0.tar.gz (147.8 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

batch_agent-0.1.0-py3-none-any.whl (70.0 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file batch_agent-0.1.0.tar.gz.

File metadata

Download URL: batch_agent-0.1.0.tar.gz
Upload date: May 10, 2026
Size: 147.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for batch_agent-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a319175f7c6e078b9476c9e6371bffa894e6164b3ecdbf4dc38ffc206cbbe4a6`
MD5	`05e7366a093fffc52de734d52c6f50a5`
BLAKE2b-256	`1f654bdfdb534d9c86e2bf9000e5b46bdda28575e14d93a2c439b0ea6fa954fb`

See more details on using hashes here.

File details

Details for the file batch_agent-0.1.0-py3-none-any.whl.

File metadata

Download URL: batch_agent-0.1.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 70.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for batch_agent-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3f04b144208aa2fd00927a33571fd433ff7ae35aac7f0b3cd66c0a8c6eb8151`
MD5	`39b44e697bbe595a10f6ea8f2ca94f97`
BLAKE2b-256	`b68d460832c9e4741e8b244e1071c081755919872e7e38cc8ef5bad196bab544`

See more details on using hashes here.

batch-agent 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Batch Agent SDK

When to use this

Install

Quickstart

Benchmarks

Table 1 — Fair comparison: BatchAgent SDK vs naive asyncio.gather

Table 2 — Live GPU: A10G 23 GB, Qwen/Qwen2.5-7B-Instruct, vLLM 0.20.1

How it works

Backends

Limitations

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes