High-throughput parallel LLM agent execution with tool deduplication, structured output, and self-hosted inference support.

These details have not been verified by PyPI

Project links

Project description

BatchAgent

BatchAgent is a Python SDK for running many tool-using LLM agents against one shared inference backend.

It is built for workloads like:

Review these 100 files and return one structured bug report per file.
Research these 100 papers and return one page of findings per paper.
Run 50 coding/research agents, stream each answer, then reduce them into one synthesis.

The point is simple: if 100 agents share the same system prompt, tool set, and workflow shape, they should not each recompute the same prefill or execute the same tool call independently. BatchAgent coordinates the agent loop, tool pool, prefix warming, result streaming, and backend scheduling so vLLM, SGLang, or Dynamo can reuse shared work.

This is not a general agent framework. It is a batch execution engine: input list in, structured result list out.

Results

OpenCode sessions on shared SGLang

Hardware: H100, Qwen/Qwen2.5-32B-Instruct, real OpenCode CLI sessions (opencode run subprocesses).

Condition A is isolated execution with a cache flush between sessions. Condition C is parallel execution against one shared SGLang server with the shared prefix warmed once.

N	Isolated wall-clock	Shared SGLang wall-clock	Speedup	Prefill tokens isolated	Prefill tokens shared	Reduction
5	28.77s	10.14s	2.84x	63,903	1,981	97%
10	53.56s	19.86s	2.70x	128,182	4,920	96%
20	115.39s	36.32s	3.18x	257,094	9,874	96%
50	286.72s	98.00s	2.93x	643,918	23,605	96%
100	573.21s	190.80s	3.00x	1,287,681	50,806	96%

The prefill column is the core result. At N=100, isolated inference computes 1.28M prefill tokens. Shared SGLang computes 50K. Same output, 96% less prefill compute.

Real multi-turn research task

Hardware: H100, Qwen/Qwen2.5-32B-Instruct, 20 fixed arXiv paper IDs, 2048-token shared prompt target, two turns per agent, deterministic 400ms web_search grouped by topic.

Condition A is a raw endpoint loop that holds its concurrency slot during tool wait. Condition C is BatchAgent using the same backend forwards plus tool coalescing and scheduler release during tool wait.

Source files:

tests/benchmarks/results/research_summary/dynamo_h100_results.json
tests/benchmarks/results/research_summary/sglang_h100_results.json
tests/benchmarks/results/research_summary/vllm_h100_results.json

Backend	Condition	Wall	TTFT P50	TTFT P95	Prefix cache hit	Tool dedup
Dynamo on SGLang	A	5.105s	0.201s	1.381s	93.9%	1.00x
Dynamo on SGLang	C	3.242s	0.194s	0.238s	98.8%	2.22x
SGLang	A	6.361s	0.187s	2.590s	0.0%	1.00x
SGLang	C	3.158s	0.191s	0.219s	98.6%	2.22x
vLLM	A	4.765s	0.265s	1.175s	96.2%	1.00x
vLLM	C	2.766s	0.244s	0.308s	97.5%	2.22x

BatchAgent speedup on this task:

Dynamo: 1.57x
SGLang: 2.01x
vLLM: 1.72x

In every BatchAgent run, 20 requested web searches were reduced to 9 actual executions through tool coalescing.

Why This Exists

Existing tools solve different halves of the problem:

Orchestration tools can spawn many agents, but usually do not reason about KV cache, prefill, or backend scheduling.
Inference engines can batch and cache requests, but they do not know that 100 requests are sibling agents in the same workflow.

BatchAgent is the layer between them. It gives the inference backend the workload shape it can optimize:

shared system prompt prefix
many sibling agent sessions
bounded inference concurrency
tool waits that do not occupy GPU request slots
duplicate tool calls that can be coalesced
scheduler hints for Dynamo-compatible backends

Install

pip install batch-agent

From source:

git clone https://github.com/alityb/batchagent.git
cd batchagent
pip install -e .

Requires Python 3.10+.

Basic Usage

from batch_agent import BatchAgent, Tool
from pydantic import BaseModel

class PaperSummary(BaseModel):
    title: str
    contribution: str
    relevance: str

results = await BatchAgent.run(
    task="Summarize this paper and extract the key result:\n\n{paper}",
    inputs=[{"paper": text} for text in papers],
    tools=[Tool.web_search, Tool.read_file],
    output_schema=PaperSummary,
    model="Qwen/Qwen2.5-32B-Instruct",
    backend="sglang://localhost:30000",
    max_inflight=32,
    max_turns=3,
)

Stream results as they finish:

async for result in BatchAgent.stream(...):
    print(result.job_id, result.output)

Plan -> map -> reduce:

results, paper = await BatchAgent.run_with_map_reduce(
    plan_prompt="Generate 20 research questions about: {topic}",
    plan_inputs={"topic": "KV cache optimization for multi-agent inference"},
    plan_output_schema=ResearchPlan,  # items: list[str]
    task="Research this question: {item}",
    output_schema=ResearchAnswer,
    reduce="Synthesize the answers into a survey.",
    reduce_schema=SurveyPaper,
    tools=[Tool.web_search],
    model="Qwen/Qwen2.5-32B-Instruct",
    backend="sglang://localhost:30000",
)

OpenCode runtime:

from batch_agent import BatchAgent
from batch_agent.runtimes import OpenCodeRuntime

results = await BatchAgent.run(
    runtime=OpenCodeRuntime(
        backend="sglang://localhost:30000",
        model="Qwen/Qwen2.5-32B-Instruct",
    ),
    task="Review this file for bugs: {file}",
    inputs=[{"file": f} for f in files],
    max_agents=20,
)

Each worker runs opencode run as a subprocess with OPENCODE_CONFIG_CONTENT pointed at the shared SGLang/vLLM server.

Backends

Backend	URL	Notes
SGLang	`sglang://localhost:30000`	Best current path for shared-prefix OpenCode sessions
vLLM	`vllm://localhost:8000`	Live H100 tested with prefix caching
NVIDIA Dynamo	`dynamo://localhost:8001`	Live H100 tested with `nvext.agent_hints` path
Anthropic API	`anthropic://`	Degraded mode; no direct KV control
OpenAI API	`openai://`	Degraded mode; no direct KV control
AWS Bedrock	`bedrock://us-east-1`	Degraded mode; managed prompt caching only

API backends are useful for compatibility, but the main performance story is self-hosted inference where the SDK can align orchestration with the serving engine.

What BatchAgent Optimizes

Optimization	Status	Verified
Multi-turn agent loop	implemented	tests
Structured Pydantic output	implemented	tests
Streaming result delivery	implemented	tests
Tool coalescing	implemented	H100 research-summary benchmark
Release inference slots during tool wait	implemented	H100 research-summary benchmark
Shared-prefix cache usage	backend-dependent	H100 SGLang/vLLM/Dynamo
OpenCode shared backend runtime	implemented	H100 OpenCode benchmark
Dynamo `nvext.agent_hints`	implemented	live Dynamo smoke + benchmark
KVFlow-style prefetch	hints only	not verified as a latency win
TokenDance-style diff KV	prototype	mock only
Distributed Redis orchestration	prototype	local/mock only

Architecture

BatchAgent.run(...)
    |
    v
Task compiler
    - creates one job per input
    - separates shared prefix from per-agent input
    - injects output schema
    |
    v
Wave scheduler
    - runs model turns with bounded concurrency
    - releases the inference slot during TOOL_WAIT
    - streams completed results immediately
    - emits backend hints where supported
    |
    v
Tool pool
    - coalesces concurrent identical calls
    - caches deterministic calls
    - rate-limits external tools
    |
    v
Backend/runtime
    - SGLang / vLLM / Dynamo / API backend
    - OpenCode subprocess runtime
    - metrics collection

The key invariant: inference slots wrap model calls only. Tool waits do not hold GPU request capacity.

Limitations

KVFlow prefetch is not verified. The current vLLM API-route approach cannot safely turn BatchAgent kv_key hints into CPU->GPU block mappings; that needs scheduler-level integration.
TokenDance-style diff KV is a prototype. The compression number is synthetic, not a live vLLM CacheEngine result.
Distributed Redis orchestration exists as a prototype but is not yet validated as a 1,000-agent, multi-node deployment.
API backends cannot expose the same cache/scheduler controls as self-hosted vLLM, SGLang, or Dynamo.
Tool coalescing only helps when agents request the same cacheable tool call during overlapping windows.

Research Context

BatchAgent is motivated by the gap between agent orchestration and inference serving:

vLLM provides continuous batching, PagedAttention, and prefix caching.
SGLang provides RadixAttention, which is well suited to shared-prefix and branching-context workloads.
NVIDIA Dynamo exposes production serving hooks such as nvext.agent_hints.
LMCache targets hierarchical KV movement across GPU, CPU, disk, and remote storage.
TokenDance and KVFlow show the systems opportunity for multi-agent KV sharing and workflow-aware cache movement.

References:

TokenDance: https://arxiv.org/abs/2604.03143
KVFlow: https://arxiv.org/abs/2507.07400
LMCache: https://arxiv.org/abs/2510.09665
vLLM prefix caching: https://docs.vllm.ai/en/stable/design/prefix_caching.html
Agent Orchestrator: https://github.com/ComposioHQ/agent-orchestrator
OpenAI Batch API: https://platform.openai.com/docs/guides/batch/overview

Development

Run tests:

pytest -q

Run the H100 research-summary benchmark:

PYTHONPATH=. python tests/benchmarks/bench_research_summary.py \
  --base-url http://localhost:30000 \
  --backend sglang://localhost:30000 \
  --label sglang_qwen25_32b_h100 \
  --max-inflight 8 \
  --tool-latency 400 \
  --system-prompt-tokens 2048

Run the raw-backend vs BatchAgent slow-tool benchmark:

PYTHONPATH=. python tests/benchmarks/backend_raw_vs_batchagent.py \
  --base-url http://localhost:30000 \
  --label sglang_h100 \
  --n 100 \
  --tool-latency 800 \
  --system-prompt-tokens 2048 \
  --max-inflight 32

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

May 11, 2026

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batch_agent-0.2.0.tar.gz (259.8 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

batch_agent-0.2.0-py3-none-any.whl (86.2 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file batch_agent-0.2.0.tar.gz.

File metadata

Download URL: batch_agent-0.2.0.tar.gz
Upload date: May 11, 2026
Size: 259.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for batch_agent-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c2c505910e8aea4cf7630fe8fb6ed0d6182d4551b02bdcf6c50df2988c0d2c66`
MD5	`9fada728322462ed544e9bf674aaf97c`
BLAKE2b-256	`a3c0cf8c666490a315ac81162a4d49cef3c25be3d4ef490cc12b9770c6c56d66`

See more details on using hashes here.

File details

Details for the file batch_agent-0.2.0-py3-none-any.whl.

File metadata

Download URL: batch_agent-0.2.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 86.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for batch_agent-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9b7d119580624157813a9d3e9ccb03b431d06c8aa6f8340bbf7f12071b2a3dd`
MD5	`0ecf0a9865de2dd418649de18f52f375`
BLAKE2b-256	`49d5abe0a7a55d062fa62988e201acabd0bc360d6f86dbedebf675c643d0db5d`

See more details on using hashes here.

batch-agent 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BatchAgent

Results

OpenCode sessions on shared SGLang

Real multi-turn research task

Why This Exists

Install

Basic Usage

Backends

What BatchAgent Optimizes

Architecture

Limitations

Research Context

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes