Benchmark Suite & E2E Testing for sageLLM
Project description
sagellm-benchmark
Protocol Compliance (Mandatory)
- MUST follow Protocol v0.1: https://github.com/intellistream/sagellm-docs/blob/main/docs/specs/protocol_v0.1.md
- Any globally shared definitions (fields, error codes, metrics, IDs, schemas) MUST be added to Protocol first.
Benchmark suite for sageLLM inference engine performance and validation.
New here? See QUICKSTART.md for a 5-minute guide.
Canonical boundary note: performance-path ownership is defined in https://github.com/intellistream/sagellm-docs/blob/main/docs/specs/performance_mainline_architecture.md. sagellm-benchmark validates the main path externally; it does not define runtime semantics for sagellm-core.
Mainline Validation Mapping
Use benchmark output as proof for the canonical performance mainline, not as a standalone score table.
avg_ttft_ms: admission + prefill responsiveness. Good for spotting whether batching changes hurt interactive startup.avg_tbt_ms: decode-step latency on the formal execution path. This is the first field to inspect forstateful batch decodeconvergence.avg_throughput_tps: average per-request throughput. Useful for request-level decode efficiency, but not a substitute for aggregate batch throughput.output_throughput_tps: best top-line field for shared-stream and paged/native convergence when comparing batch size>= 2.request_throughput_rps: useful when an optimization changes concurrent admission or batch turnover more than single-request token speed.shared_stream_markers.hits: required log evidence that shared batching actually activated.paged_path_markers.hits: evidence that paged/native or explicit fallback attention implementations were actually reached.block_table_markers.hits: evidence that scheduler-providedblock_tablescrossed the runtime boundary and survived to execution.
Interpretation rule:
- Better
avg_tbt_msoroutput_throughput_tpswithout marker evidence is only a performance observation, not proof that the intended mainline path converged. - Marker evidence without competitive latency/throughput means the path is wired, but not yet performant.
- A valid convergence claim should combine metric deltas with
/info,/metrics, or*_log_probe.jsonevidence.
Features
- End-to-end Q1-Q8 query workloads covering diverse LLM scenarios
- Standardized JSON metrics and reports
- One-command benchmark runner
- Extensible backend support
- Performance benchmark CLI (
perf) for operator and E2E benchmark baselines - Canonical
compareentrypoint for sagellm vs vllm/lmdeploy endpoint benchmarking - Convergence validation profile for shared-stream batching, block-table usage, and paged/native path evidence
Dependencies
- isagellm-protocol (>=0.4.0.0)
- isagellm-core (>=0.4.0.0)
- isagellm-backend (>=0.4.0.1)
Installation
pip install isagellm-benchmark
For specific backend support:
# With vLLM support (non-Ascend)
pip install 'isagellm-benchmark[vllm-client]'
# With vLLM Ascend support (Ascend machines)
pip install 'isagellm-benchmark[vllm-ascend-client]'
# With LMDeploy support
pip install 'isagellm-benchmark[lmdeploy-client]'
Dependency policy:
pyproject.tomlextras are the single source of truth for third-party compare clients.quickstart.shand setup scripts are convenience layers that install those extras and, when needed, add a validated runtime matrix on top.- Cross-engine comparison belongs on the benchmark side only;
sagellm-coreis not a third-party engine compare entry.
Quick Start
# Run all Q1-Q8 workloads with CPU backend
sagellm-benchmark run --workload all --backend cpu --output ./benchmark_results
# Run a single query workload
sagellm-benchmark run --workload Q1 --backend cpu
# Generate a markdown report
sagellm-benchmark report --input ./benchmark_results/benchmark_summary.json --format markdown
# Run migrated performance benchmarks
sagellm-benchmark perf --type operator --device cpu
sagellm-benchmark perf --type e2e --model Qwen/Qwen2-7B-Instruct --batch-size 1 --batch-size 4
# Compare multiple OpenAI-compatible endpoints through benchmark clients
sagellm-benchmark compare \
--target sagellm=http://127.0.0.1:8902/v1 \
--target vllm=http://127.0.0.1:8901/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct
# If GPU memory is tight, capture the two engines sequentially and compare offline.
sagellm-benchmark compare-record \
--label sagellm \
--url http://127.0.0.1:8901/v1 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--output-dir ./benchmark_results/sequential/sagellm
sagellm-benchmark compare-record \
--label vllm \
--url http://127.0.0.1:9100/v1 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--output-dir ./benchmark_results/sequential/vllm
sagellm-benchmark compare-offline \
--result sagellm=./benchmark_results/sequential/sagellm/sagellm.json \
--result vllm=./benchmark_results/sequential/vllm/vllm.json \
--output-dir ./benchmark_results/sequential/compare
# In an interactive terminal, compare can also prompt to kill local target
# processes after the benchmark finishes.
sagellm-benchmark compare \
--target sagellm=http://127.0.0.1:8902/v1 \
--target vllm=http://127.0.0.1:8901/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt-cleanup
# If local endpoints are not running yet, provide per-target start commands.
sagellm-benchmark compare \
--target sagellm=http://127.0.0.1:8902/v1 \
--target vllm=http://127.0.0.1:8000/v1 \
--target-command "sagellm=sagellm serve --backend cuda --model Qwen/Qwen2.5-0.5B-Instruct --port 8902" \
--target-command "vllm=vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000" \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt-cleanup
# Convenience profile for the standard sageLLM vs vLLM layout
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:8000/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct
# Prompt to clean up the locally running SageLLM/vLLM endpoints afterwards.
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:8000/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt-cleanup
# Optionally auto-start local SageLLM/vLLM endpoints if they are not up yet.
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:8000/v1 \
--start-sagellm-cmd "sagellm serve --backend cuda --model Qwen/Qwen2.5-0.5B-Instruct --port 8901" \
--start-vllm-cmd "vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000" \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt-cleanup
# Recommended on A100/CUDA hosts: keep vLLM in a dedicated Docker container,
# then reuse the stable endpoint for each compare run.
VLLM_GPU_DEVICE=1 VLLM_PORT=9100 \
./scripts/start_vllm_cuda_docker.sh
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:9100/v1 \
--model Qwen/Qwen2.5-1.5B-Instruct
# Or let compare bootstrap the Dockerized vLLM endpoint on demand.
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:9100/v1 \
--start-vllm-cmd "./scripts/start_vllm_cuda_docker.sh" \
--model Qwen/Qwen2.5-1.5B-Instruct
# If startup fails, logs remain available because the helper does not use
# --rm by default:
docker logs sagellm-benchmark-vllm | tail -n 200
# The helper defaults to --network host, which is more reliable on locked-down
# servers where Docker bridge networking cannot reach huggingface.co.
# Generate charts (PNG/PDF, dark theme)
sagellm-benchmark perf --type e2e --plot --plot-format png --plot-format pdf --theme dark
# Keep the default Q1-Q8 CPU flow
./run_benchmark.sh
# Run the convergence validation loop against two live endpoints
./run_benchmark.sh --profile convergence \
--target before=http://127.0.0.1:8901/v1 \
--target after=http://127.0.0.1:8902/v1 \
--log-file before=/tmp/sagellm-before.log \
--log-file after=/tmp/sagellm-after.log \
--model Qwen/Qwen2.5-0.5B-Instruct
When validating the mainline architecture rather than just endpoint speed, preserve three artifact classes together:
- compare results:
comparison.json/.md - runtime surfaces:
*_info.json,*_metrics.prom - path evidence:
*_log_probe.json
Convergence Validation Loop
Use the benchmark repo as the external validation layer for recent sagellm-core and sagellm-backend convergence work. The convergence profile keeps runtime selection outside sagellm-core, then captures both benchmark deltas and endpoint observability snapshots.
Standard result fields to compare:
avg_ttft_msavg_tbt_msavg_throughput_tpsoutput_throughput_tpsrequest_throughput_rpsshared_stream_markers.hitspaged_path_markers.hitsblock_table_markers.hits
Standard artifacts written by ./run_benchmark.sh --profile convergence:
comparison.jsonandcomparison.mdvalidation_summary.jsonandVALIDATION.mdREPRODUCE.sh<label>.jsonand<label>.md<label>_info.json<label>_metrics.prom<label>_log_probe.jsonwhen--log-file LABEL=PATHis provided
Recommended benchmark interpretation:
- Shared-stream batching: compare
avg_ttft_ms,avg_tbt_ms, andoutput_throughput_tpsat--batch-size 2and--batch-size 4, then confirm the candidate endpoint shows non-zeroshared_stream_markers.hits. - Paged/native path usage: inspect
<label>_metrics.prom,<label>_info.json, and<label>_log_probe.jsonfor non-zeropaged_path_markers.hits. - Formal block-table path: inspect
<label>_log_probe.jsonfor non-zeroblock_table_markers.hits, then correlate with the batch-size latency/throughput deltas.
Reproducible command templates:
Shared stream before/after:
./run_benchmark.sh --profile convergence \
--target before=http://127.0.0.1:8901/v1 \
--target after=http://127.0.0.1:8902/v1 \
--log-file before=/var/log/sagellm-before.log \
--log-file after=/var/log/sagellm-after.log \
--model Qwen/Qwen2.5-0.5B-Instruct \
--batch-size 1 --batch-size 2 --batch-size 4 \
--max-output-tokens 64
Paged/native on vs off:
./run_benchmark.sh --profile convergence \
--target torch_fallback=http://127.0.0.1:8911/v1 \
--target native_ascend=http://127.0.0.1:8912/v1 \
--log-file torch_fallback=/var/log/sagellm-fallback.log \
--log-file native_ascend=/var/log/sagellm-native.log \
--model Qwen/Qwen2.5-0.5B-Instruct \
--batch-size 1 --batch-size 2 --batch-size 4 \
--max-output-tokens 64
Cross-backend comparison on domestic hardware:
./run_benchmark.sh --profile convergence \
--target ascend_native=http://127.0.0.1:8921/v1 \
--target kunlun_native=http://127.0.0.1:8922/v1 \
--target musa_native=http://127.0.0.1:8923/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--batch-size 1 --batch-size 2 --batch-size 4 \
--max-output-tokens 64
On Ascend hosts, start the SageLLM endpoint through the umbrella runtime wrapper before benchmarking:
cd /home/shuhao/sagellm
./scripts/sagellm_with_ascend_env.sh sagellm serve --backend ascend --model Qwen/Qwen2.5-0.5B-Instruct --port 8912
CLI examples:
# Run the full Q1-Q8 suite with the CPU backend
sagellm-benchmark run --workload all --backend cpu
# Run with a CPU model
sagellm-benchmark run --workload all --backend cpu --model sshleifer/tiny-gpt2
# Run a single query workload
sagellm-benchmark run --workload Q3 --backend cpu
# Generate reports
sagellm-benchmark report --input ./benchmark_results/benchmark_summary.json --format markdown
# Generate report from perf JSON
sagellm-benchmark report --input ./benchmark_results/perf_results.json --format markdown
# Re-generate charts from existing perf JSON
sagellm-benchmark report --input ./benchmark_results/perf_results.json --plot --plot-format png
Ascend vLLM 对比评测
sagellm-benchmark 的 compare 是唯一推荐的跨引擎对比入口。perf --live 继续保留为单 endpoint 性能采集能力;真正的 sagellm vs vllm/lmdeploy 对比统一通过 compare 或 benchmark client 完成。
对外统一提示词:请只在 sagellm-benchmark 中进行第三方引擎对比实验,使用 compare 或 vllm-compare 入口,先完成依赖安装、Ascend 环境注入与 endpoint 判活,再基于 OpenAI-compatible endpoints 产出对比结果,不要把 vLLM/LMDeploy/SGLang 的 adaptor、依赖或实验脚本回灌到 sagellm-core。
如果当前目标就是标准的 sageLLM vs vLLM 对比,也可以使用便利入口:
sagellm-benchmark vllm-compare install-ascend
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:8000/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct
如需在 Ascend 机器上复现 vllm-ascend vs sagellm 对比,优先参考:
- docs/ASCEND_BENCHMARK.md
- scripts/setup_vllm_ascend_compare_env.sh
- scripts/run_vllm_ascend_container.sh
其中 pyproject.toml 里的 benchmark extras 是依赖声明的唯一事实来源;
scripts/setup_vllm_ascend_compare_env.sh 只是在其之上附加一套已验证的 Ascend 版本矩阵,作为便利层而非新的依赖入口。
如果直接运行 ./quickstart.sh,脚本也会先安装匹配当前硬件的 benchmark extra,再视场景叠加便利层安装步骤。
- 分别启动两个服务(例如
sageLLM与vLLM Ascend),确保都提供/v1/models与/v1/chat/completions。 - 运行对比命令:
sagellm-benchmark vllm-compare run \
--sagellm-url http://127.0.0.1:8901/v1 \
--vllm-url http://127.0.0.1:8000/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct
等价的正式 CLI 用法:
sagellm-benchmark compare \
--target sagellm=http://127.0.0.1:8901/v1 \
--target vllm=http://127.0.0.1:8000/v1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--batch-size 1 --batch-size 2 --batch-size 4 \
--max-output-tokens 64
输出会写入 benchmark_results/compare_*/,包含:
<target>.json/.mdcomparison.md(汇总 TTFT/TBT/TPS 差异)comparison.json(结构化对比摘要)
如需同时验证 shared-stream batching 与 paged/block-table 路径,优先使用上面的 run_benchmark.sh --profile convergence,因为它会额外落盘 /info、/metrics 和可选日志探针结果。
Workloads
- Q1: Short Q&A — 32 prompt → 64 output (5 requests)
- Q2: Long context summarization — 512 prompt → 128 output (3 requests)
- Q3: Code generation — 128 prompt → 256 output (3 requests)
- Q4: Multi-turn conversation — 256 prompt → 256 output (3 requests)
- Q5: Concurrent short requests — 32 prompt → 64 output (10 concurrent)
- Q6: Concurrent long context — 512 prompt → 256 output (10 concurrent)
- Q7: Chain-of-thought reasoning — 256 prompt → 512 output (3 requests)
- Q8: Composite task — 192 prompt → 128 output (4 concurrent)
Outputs
After running the benchmark, results are written to a folder like:
benchmark_results/
├── benchmark_summary.json
├── Q1_metrics.json
├── Q2_metrics.json
├── ...
├── Q8_metrics.json
└── REPORT.md
Metrics include latency, throughput, memory, and error rates. See docs/USAGE.md for details.
Backends
- cpu: CPU inference via HuggingFace Transformers (requires
--model) - compare targets:
sagellm/vllm/lmdeploy通过compare或 benchmark clients 接入,而不是通过run --backend
Compare Policy
sagellm-benchmark compare是 sagellm 与第三方引擎对比的唯一推荐入口。- 优先使用 OpenAI-compatible endpoint 做对比;若第三方服务不提供兼容 endpoint,则通过
sagellm_benchmark.clients.*Python client 接入。 - 第三方引擎依赖、启动便利脚本、endpoint 验活和 live 指标采集都收敛在
sagellm-benchmark,不再要求sagellm-core承担此职责。 ./quickstart.sh会自动补装匹配当前硬件的 vLLM compare extra;Ascend 机器会在 extra 之上再叠加验证过的版本矩阵。
Development
Setup
# 1. Clone the repository
git clone https://github.com/intellistream/sagellm-benchmark.git
cd sagellm-benchmark
# 2. One-command setup (recommended)
./quickstart.sh --dev
# Optional: stable/release-oriented dependency baseline
./quickstart.sh --standard
Quickstart modes:
--standard: installs baseline dependencies from PyPI, then installs current repo in editable mode.--dev: runsstandardflow, then tries local editable overrides for sibling repos with--no-deps.- quickstart also installs the matching benchmark compare extra for the current machine; extras remain the dependency source of truth.
- Before install,
quickstart.shdynamically cleans existingisagellm-*packages for re-entrant setup.
Running Tests
pytest tests/
Local CI Fallback (when GitHub Actions is blocked)
bash scripts/local_ci_fallback.sh
This runs the same core checks as .github/workflows/ci.yml locally (pre-commit, version guard, pytest+coverage, build+twine).
Performance Regression Check (CI)
- Workflow:
.github/workflows/benchmark.yml - Baseline directory:
benchmark_baselines/ - PR: runs lightweight E2E benchmark and comments regression report on PR
- Release: runs fuller benchmark matrix and enforces regression thresholds
- Manual baseline refresh: trigger workflow with
update_baseline=true
# Generate current perf snapshot
sagellm-benchmark perf \
--type e2e \
--model Qwen/Qwen2-7B-Instruct \
--batch-size 1 --batch-size 4 --batch-size 8 \
--precision fp16 --precision int8 \
--output-json benchmark_results/perf_current.json \
--output-markdown benchmark_results/perf_current.md
# Compare current snapshot with baseline
python scripts/compare_performance_baseline.py \
--baseline benchmark_baselines/perf_baseline_e2e.json \
--current benchmark_results/perf_current.json \
--warning-threshold 5 \
--critical-threshold 10 \
--summary-json benchmark_results/perf_comparison_summary.json \
--report-md benchmark_results/perf_comparison_report.md
Code Quality
# Linting
ruff check .
# Type checking
mypy src/
Documentation
- QUICKSTART.md - 5 分钟快速开始
- docs/USAGE.md - 详细使用指南
- docs/CLIENTS_GUIDE.md - 客户端选择指南
- docs/DEPLOYMENT_ARCHITECTURE.md - 部署架构说明(HTTP API vs 直连)
🔄 贡献指南
请遵循以下工作流程:
-
创建 Issue - 描述问题/需求
gh issue create --title "[Bug] 描述" --label "bug,sagellm-benchmark"
-
开发修复 - 在本地
fix/#123-xxx分支解决git checkout -b fix/#123-xxx origin/main-dev # 开发、测试... pytest -v ruff format . && ruff check . --fix
-
发起 PR - 提交到
main-dev分支gh pr create --base main-dev --title "Fix: 描述" --body "Closes #123"
-
合并 - 审批后合并到
main-dev
更多详情见 .github/copilot-instructions.md
License
Private - IntelliStream Research Project
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file isagellm_benchmark-0.5.4.14.tar.gz.
File metadata
- Download URL: isagellm_benchmark-0.5.4.14.tar.gz
- Upload date:
- Size: 201.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ede3fa0b851feb2a73b26939900b4dc112948db0e6808714a5d414151952ac4
|
|
| MD5 |
9df6c111614b7d22b432f9a97b3f219e
|
|
| BLAKE2b-256 |
2dacb49b67f87e9c63155a37db3984bb6933f5751e9e10441d98e21370acfd62
|
File details
Details for the file isagellm_benchmark-0.5.4.14-py2.py3-none-any.whl.
File metadata
- Download URL: isagellm_benchmark-0.5.4.14-py2.py3-none-any.whl
- Upload date:
- Size: 224.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efaebd8b2737846da4f2441c9cbd8284ca95e556d0f8c9b79a4a4284468a16da
|
|
| MD5 |
8a44da8ce5f28e65a9d2742758977d58
|
|
| BLAKE2b-256 |
ba3b5f979af5077de38a6445e682be5aeff28a6f2b2de23070090fec299f8d87
|