LLM tool augmentation benchmark for SysML v2 model comprehension.
Project description
sysml-bench
Benchmarking harness for evaluating LLM performance on SysML v2 model comprehension tasks with CLI tool augmentation. Part of Nomograph Labs.
Repository: gitlab.com/nomograph/sysml-bench
What This Is
A Python evaluation framework that measures how different tool configurations affect LLM accuracy on structured systems engineering tasks. The benchmark uses a real SysML v2 model corpus (an Eve Online Mining Frigate design) and tests five models across 40+ experimental conditions with 132 active tasks.
The primary question: does giving an LLM more tools improve its ability to answer questions about a SysML v2 model? The answer is nuanced — it depends on the task type, the model, and whether the agent receives guidance on tool selection.
Key Observations
14 observations from our evaluation. Selected highlights:
| Observation | Result |
|---|---|
| O12: Guided tool selection | One sentence of tool selection guidance eliminates the discovery penalty entirely — guided graph 0.885 vs unguided 0.750 |
| O4: Render vs assembly | Pre-rendered views score 0.868 vs 0.399 for agentic assembly on explanation tasks (47pp gap, 6.6x lower cost) |
| O1: Tool-task interaction | Graph tools hurt discovery but help layer tasks. The effect is heterogeneous — no single tool set dominates |
| O10: Corpus scaling | Performance collapses from 0.880 at 19 files to 0.423 at 95 files; graph tools and vectors don't help at scale |
| O2: Model quality gap | Sonnet 0.880 vs best OpenAI 0.529 — 35pp gap, but GPT-4o-mini is 87x more cost-efficient |
| O8: CLI vs RAG | CLI dominates on discovery (+0.289); RAG edges CLI on reasoning (+0.136) |
| O14: Structural traces | Graph tools don't help even at 4-5 hop traces on a 19-file corpus |
Corpus
Eve Online Mining Frigate — a SysML v2 model of a fitted mining ship designed for the Eve Online universe. 19 files, 798 elements, 1,515 relationships. Covers requirements, concerns, stakeholders, logical architecture, COTS modules, interfaces, verification cases, and rollup analysis.
A secondary corpus of 95 files (SysML v2 specification examples) is used for scaling experiments.
Task Categories
Primary corpus (Eve Online Mining Frigate, 19 files)
| Category | Count | File | Purpose |
|---|---|---|---|
| Discovery | 16 | eve_discovery.yaml |
Extract attributes, follow references, enumerate elements |
| Reasoning | 12 | eve_reasoning.yaml |
Multi-hop traversal, counterfactual analysis, exhaustive enumeration |
| Explanation | 8 | eve_explanation.yaml |
Generate human-readable descriptions from model data |
| Generative | 8 | eve_generative.yaml |
Open-ended generation tasks with LLM-scored evaluation |
| Layer | 20 | eve_layer_tools.yaml |
RFLP layer classification, coverage metrics |
| Boundary | 8 | eve_boundary.yaml |
2-3 hop traversals at the graph-tool benefit threshold |
| Vector-sensitive | 8 | eve_vector_sensitive.yaml |
Paraphrase-gap tasks testing semantic retrieval |
| Structural trace | 8 | eve_structural_trace.yaml |
4-5 hop traces and exhaustive chain enumeration |
Multi-corpus discovery (scaling and generalization)
| Corpus | Count | File | Purpose |
|---|---|---|---|
| SysML v2 examples | 20 | examples_discovery.yaml |
Discovery on 95-file specification corpus |
| Arrowhead | 6 | arrowhead_discovery.yaml |
Discovery on Arrowhead framework model |
| Drone | 4 | drone_discovery.yaml |
Discovery on drone system model |
| HVAC / HSUV | 6 | hvac_hsuv_discovery.yaml |
Discovery on HVAC and hybrid SUV models |
| Vehicle | 8 | vehicle_discovery.yaml |
Discovery on vehicle system model |
Total: 132 active tasks across 13 task files. An additional 82 archived tasks
from earlier experimental phases are in eval/tasks/archive/.
Tool Sets
| Tool Set | Tools | Schema Tokens | Description |
|---|---|---|---|
cli_search |
sysml_search, read_file |
~250 | Search + read. Baseline. |
cli_graph |
+ sysml_trace, sysml_check, sysml_query, sysml_inspect |
~1120 | Graph traversal tools |
cli_render |
cli_graph + sysml_render |
~1300 | Server-side Markdown rendering |
cli_full |
cli_render + sysml_stat, sysml_plan |
~1500 | Full tool set including planning |
Scoring
Per-field structured scoring with the following types:
| Type | Mechanism |
|---|---|
Bool |
Exact boolean match → 1.0 or 0.0 |
Float |
Within tolerance (default ±0.05) → 1.0 or 0.0 |
Str |
Case-insensitive exact match with qualified-name suffix matching (A::B matches B) |
StrContains |
Case-insensitive substring match |
ListStr |
Set-based F1 score with threshold (default 0.8). Supports qualified-name matching. Binary: ≥threshold → 1.0, else 0.0 |
Task score = mean of field scores. Condition score = mean of task scores across N runs.
Reproduction
Prerequisites
- Python ≥3.12
- uv package manager
nomograph-sysmlCLI binary on$PATHANTHROPIC_API_KEYand/orOPENAI_API_KEYenvironment variables
Setup
git clone https://gitlab.com/nomograph/sysml-bench.git
cd sysml-bench
uv sync
Run an experiment
uv run python -m eval.llm_cli \
--task-file eval/tasks/eve_discovery.yaml \
--models claude-sonnet-4-20250514 \
--tool-set cli_search \
--runs 3 --max-turns 15 \
--output results/my-experiment.json
With Docker
docker pull registry.gitlab.com/nomograph/sysml-bench:latest
docker run -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
registry.gitlab.com/nomograph/sysml-bench:latest \
python -m eval.llm_cli \
--task-file eval/tasks/eve_discovery.yaml \
--models claude-sonnet-4-20250514 \
--tool-set cli_search --runs 1 --max-turns 15 \
--output /results/my-experiment.json
With guided system prompt
uv run python -m eval.llm_cli \
--task-file eval/tasks/eve_discovery.yaml \
--models claude-sonnet-4-20250514 \
--tool-set cli_graph \
--system-prompt-file eval/prompts/guided_discovery.txt \
--runs 5 --max-turns 15 \
--output results/guided-experiment.json
With vector search
uv run python -m eval.llm_cli \
--task-file eval/tasks/eve_discovery.yaml \
--models claude-sonnet-4-20250514 \
--tool-set cli_search --vectors \
--runs 3 --max-turns 15 \
--output results/vector-experiment.json
Results
Result JSON files are stored separately in nomograph/sysml-bench-results (private — contains model outputs and cost data). Each result file is a JSON document containing run metadata and per-task scored results with cost, token counts, and per-field scoring breakdowns.
When running experiments, output files are written to a local results/
directory (gitignored in this repo).
Dataset
The benchmark tasks and baseline results are available on HuggingFace:
from datasets import load_dataset
tasks = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="tasks")
results = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="results")
See nomograph/sysml-v2-reasoning-benchmark for the full dataset card.
Known Limitations
-
Corpus size: Primary corpus is 19 files. Results may not generalize to production-scale models (1000+ files). Scaling experiments (O10) show dramatic performance drops at 95 files.
-
Model versions: All results are from specific model snapshots (claude-sonnet-4-20250514, gpt-4o-2024-08-06, etc.). Results may differ with future model versions.
-
Stochastic variance: Some tasks show high run-to-run variance (e.g., D12, D13 are bimodal). N=3-5 replication mitigates but does not eliminate this.
-
ListStr scoring: The set-based F1 scorer handles flat string lists only. Compound answer types (list-of-dicts) require decomposition into separate fields.
-
Single domain: All tasks are from a SysML v2 model. The observations may not generalize to other modeling languages or domains.
License
MIT — see LICENSE
Citation
@article{dunn2026sysmlbench,
title={sysml-bench: Evaluating Tool-Augmented {LLMs} on {SysML} v2 Model Comprehension},
author={Dunn, Andrew},
year={2026},
journal={arXiv preprint},
url={https://gitlab.com/nomograph/sysml-bench},
note={Nomograph Labs}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sysml_bench-0.1.1.tar.gz.
File metadata
- Download URL: sysml_bench-0.1.1.tar.gz
- Upload date:
- Size: 308.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98d7b1096e23138df4513fd659721a4c1888183355ce0c13669a81cfcae766b9
|
|
| MD5 |
9b979b61bfc06a2e698beb615b8b0195
|
|
| BLAKE2b-256 |
dd0849f855067b0c7af15ff708345c15e6ce8372f5ff8297bbd7e7da4e7bfd2b
|
File details
Details for the file sysml_bench-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sysml_bench-0.1.1-py3-none-any.whl
- Upload date:
- Size: 132.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cc3f0f10b236b136dd75daa61eea640a68ef37f23f9e7765bd25bbf09f49bcb
|
|
| MD5 |
41b5adf811d083f0b65155c8427fe3f5
|
|
| BLAKE2b-256 |
dc3cb5e2275961b3f78869b05285c2afb4c3c1f169621ebfc52e3af4f6ec26ff
|