Skip to main content

LLM tool augmentation benchmark for SysML v2 model comprehension.

Project description

sysml-bench

Nomograph Labs License: MIT Pipeline PyPI Dataset

Benchmarking harness for evaluating LLM performance on SysML v2 model comprehension tasks with CLI tool augmentation. Part of Nomograph Labs.

Repository: gitlab.com/nomograph/sysml-bench

What This Is

A Python evaluation framework that measures how different tool configurations affect LLM accuracy on structured systems engineering tasks. The benchmark uses a real SysML v2 model corpus (an Eve Online Mining Frigate design) and tests five models across 40+ experimental conditions with 132 active tasks.

The primary question: does giving an LLM more tools improve its ability to answer questions about a SysML v2 model? The answer is nuanced — it depends on the task type, the model, and whether the agent receives guidance on tool selection.

Key Observations

14 observations from our evaluation. Selected highlights:

Observation Result
O12: Guided tool selection One sentence of tool selection guidance eliminates the discovery penalty entirely — guided graph 0.885 vs unguided 0.750
O4: Render vs assembly Pre-rendered views score 0.868 vs 0.399 for agentic assembly on explanation tasks (47pp gap, 6.6x lower cost)
O1: Tool-task interaction Graph tools hurt discovery but help layer tasks. The effect is heterogeneous — no single tool set dominates
O10: Corpus scaling Performance collapses from 0.880 at 19 files to 0.423 at 95 files; graph tools and vectors don't help at scale
O2: Model quality gap Sonnet 0.880 vs best OpenAI 0.529 — 35pp gap, but GPT-4o-mini is 87x more cost-efficient
O8: CLI vs RAG CLI dominates on discovery (+0.289); RAG edges CLI on reasoning (+0.136)
O14: Structural traces Graph tools don't help even at 4-5 hop traces on a 19-file corpus

Corpus

Eve Online Mining Frigate — a SysML v2 model of a fitted mining ship designed for the Eve Online universe. 19 files, 798 elements, 1,515 relationships. Covers requirements, concerns, stakeholders, logical architecture, COTS modules, interfaces, verification cases, and rollup analysis.

A secondary corpus of 95 files (SysML v2 specification examples) is used for scaling experiments.

Task Categories

Primary corpus (Eve Online Mining Frigate, 19 files)

Category Count File Purpose
Discovery 16 eve_discovery.yaml Extract attributes, follow references, enumerate elements
Reasoning 12 eve_reasoning.yaml Multi-hop traversal, counterfactual analysis, exhaustive enumeration
Explanation 8 eve_explanation.yaml Generate human-readable descriptions from model data
Generative 8 eve_generative.yaml Open-ended generation tasks with LLM-scored evaluation
Layer 20 eve_layer_tools.yaml RFLP layer classification, coverage metrics
Boundary 8 eve_boundary.yaml 2-3 hop traversals at the graph-tool benefit threshold
Vector-sensitive 8 eve_vector_sensitive.yaml Paraphrase-gap tasks testing semantic retrieval
Structural trace 8 eve_structural_trace.yaml 4-5 hop traces and exhaustive chain enumeration

Multi-corpus discovery (scaling and generalization)

Corpus Count File Purpose
SysML v2 examples 20 examples_discovery.yaml Discovery on 95-file specification corpus
Arrowhead 6 arrowhead_discovery.yaml Discovery on Arrowhead framework model
Drone 4 drone_discovery.yaml Discovery on drone system model
HVAC / HSUV 6 hvac_hsuv_discovery.yaml Discovery on HVAC and hybrid SUV models
Vehicle 8 vehicle_discovery.yaml Discovery on vehicle system model

Total: 132 active tasks across 13 task files. An additional 82 archived tasks from earlier experimental phases are in eval/tasks/archive/.

Tool Sets

Tool Set Tools Schema Tokens Description
cli_search sysml_search, read_file ~250 Search + read. Baseline.
cli_graph + sysml_trace, sysml_check, sysml_query, sysml_inspect ~1120 Graph traversal tools
cli_render cli_graph + sysml_render ~1300 Server-side Markdown rendering
cli_full cli_render + sysml_stat, sysml_plan ~1500 Full tool set including planning

Scoring

Per-field structured scoring with the following types:

Type Mechanism
Bool Exact boolean match → 1.0 or 0.0
Float Within tolerance (default ±0.05) → 1.0 or 0.0
Str Case-insensitive exact match with qualified-name suffix matching (A::B matches B)
StrContains Case-insensitive substring match
ListStr Set-based F1 score with threshold (default 0.8). Supports qualified-name matching. Binary: ≥threshold → 1.0, else 0.0

Task score = mean of field scores. Condition score = mean of task scores across N runs.

Reproduction

Prerequisites

  • Python ≥3.12
  • uv package manager
  • nomograph-sysml CLI binary on $PATH
  • ANTHROPIC_API_KEY and/or OPENAI_API_KEY environment variables

Setup

git clone https://gitlab.com/nomograph/sysml-bench.git
cd sysml-bench
uv sync

Run an experiment

uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search \
    --runs 3 --max-turns 15 \
    --output results/my-experiment.json

With Docker

docker pull registry.gitlab.com/nomograph/sysml-bench:latest
docker run -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
    registry.gitlab.com/nomograph/sysml-bench:latest \
    python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search --runs 1 --max-turns 15 \
    --output /results/my-experiment.json

With guided system prompt

uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_graph \
    --system-prompt-file eval/prompts/guided_discovery.txt \
    --runs 5 --max-turns 15 \
    --output results/guided-experiment.json

With vector search

uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search --vectors \
    --runs 3 --max-turns 15 \
    --output results/vector-experiment.json

Results

Result JSON files are stored separately in nomograph/sysml-bench-results (private — contains model outputs and cost data). Each result file is a JSON document containing run metadata and per-task scored results with cost, token counts, and per-field scoring breakdowns.

When running experiments, output files are written to a local results/ directory (gitignored in this repo).

Dataset

The benchmark tasks and baseline results are available on HuggingFace:

from datasets import load_dataset
tasks = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="tasks")
results = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="results")

See nomograph/sysml-v2-reasoning-benchmark for the full dataset card.

Known Limitations

  1. Corpus size: Primary corpus is 19 files. Results may not generalize to production-scale models (1000+ files). Scaling experiments (O10) show dramatic performance drops at 95 files.

  2. Model versions: All results are from specific model snapshots (claude-sonnet-4-20250514, gpt-4o-2024-08-06, etc.). Results may differ with future model versions.

  3. Stochastic variance: Some tasks show high run-to-run variance (e.g., D12, D13 are bimodal). N=3-5 replication mitigates but does not eliminate this.

  4. ListStr scoring: The set-based F1 scorer handles flat string lists only. Compound answer types (list-of-dicts) require decomposition into separate fields.

  5. Single domain: All tasks are from a SysML v2 model. The observations may not generalize to other modeling languages or domains.

License

MIT — see LICENSE

Citation

@article{dunn2026sysmlbench,
  title={sysml-bench: Evaluating Tool-Augmented {LLMs} on {SysML} v2 Model Comprehension},
  author={Dunn, Andrew},
  year={2026},
  journal={arXiv preprint},
  url={https://gitlab.com/nomograph/sysml-bench},
  note={Nomograph Labs}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sysml_bench-0.1.0.tar.gz (308.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sysml_bench-0.1.0-py3-none-any.whl (132.8 kB view details)

Uploaded Python 3

File details

Details for the file sysml_bench-0.1.0.tar.gz.

File metadata

  • Download URL: sysml_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 308.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sysml_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 36d3bf5a84baca948c4394ae58e9b01ce217e027932645a9ef37c09082772774
MD5 c9e129acd7c082de42a70a5a62cf7652
BLAKE2b-256 e9b4700a93210e675c5b35c46e2520b8d07feff7cfae3fc09602bff306dfd539

See more details on using hashes here.

File details

Details for the file sysml_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sysml_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 132.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sysml_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 186a6dca576579829c8de4a39b3e50f9e7c1e8b06e41f0088077e8505c4a7b79
MD5 bbbd77e884998e09b7e1cbff9a2db7eb
BLAKE2b-256 48b446a5e1729932d14215268ae48f0e5d1580d6398d28d938a842c73ba37d4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page