LLM tool augmentation benchmark for SysML v2 model comprehension.

These details have not been verified by PyPI

Project description

sysml-bench

Benchmarking harness for evaluating LLM performance on SysML v2 model comprehension tasks with CLI tool augmentation. Part of Nomograph Labs.

Repository: gitlab.com/nomograph/sysml-bench

What This Is

A Python evaluation framework that measures how different tool configurations affect LLM accuracy on structured systems engineering tasks. The benchmark uses a real SysML v2 model corpus (an Eve Online Mining Frigate design) and tests five models across 40+ experimental conditions with 132 active tasks.

The primary question: does giving an LLM more tools improve its ability to answer questions about a SysML v2 model? The answer is nuanced — it depends on the task type, the model, and whether the agent receives guidance on tool selection.

Key Observations

14 observations from our evaluation. Selected highlights:

Observation	Result
O12: Guided tool selection	One sentence of tool selection guidance eliminates the discovery penalty entirely — guided graph 0.885 vs unguided 0.750
O4: Render vs assembly	Pre-rendered views score 0.868 vs 0.399 for agentic assembly on explanation tasks (47pp gap, 6.6x lower cost)
O1: Tool-task interaction	Graph tools hurt discovery but help layer tasks. The effect is heterogeneous — no single tool set dominates
O10: Corpus scaling	Performance collapses from 0.880 at 19 files to 0.423 at 95 files; graph tools and vectors don't help at scale
O2: Model quality gap	Sonnet 0.880 vs best OpenAI 0.529 — 35pp gap, but GPT-4o-mini is 87x more cost-efficient
O8: CLI vs RAG	CLI dominates on discovery (+0.289); RAG edges CLI on reasoning (+0.136)
O14: Structural traces	Graph tools don't help even at 4-5 hop traces on a 19-file corpus

Corpus

Eve Online Mining Frigate — a SysML v2 model of a fitted mining ship designed for the Eve Online universe. 19 files, 798 elements, 1,515 relationships. Covers requirements, concerns, stakeholders, logical architecture, COTS modules, interfaces, verification cases, and rollup analysis.

A secondary corpus of 95 files (SysML v2 specification examples) is used for scaling experiments.

Task Categories

Primary corpus (Eve Online Mining Frigate, 19 files)

Category	Count	File	Purpose
Discovery	16	`eve_discovery.yaml`	Extract attributes, follow references, enumerate elements
Reasoning	12	`eve_reasoning.yaml`	Multi-hop traversal, counterfactual analysis, exhaustive enumeration
Explanation	8	`eve_explanation.yaml`	Generate human-readable descriptions from model data
Generative	8	`eve_generative.yaml`	Open-ended generation tasks with LLM-scored evaluation
Layer	20	`eve_layer_tools.yaml`	RFLP layer classification, coverage metrics
Boundary	8	`eve_boundary.yaml`	2-3 hop traversals at the graph-tool benefit threshold
Vector-sensitive	8	`eve_vector_sensitive.yaml`	Paraphrase-gap tasks testing semantic retrieval
Structural trace	8	`eve_structural_trace.yaml`	4-5 hop traces and exhaustive chain enumeration

Multi-corpus discovery (scaling and generalization)

Corpus	Count	File	Purpose
SysML v2 examples	20	`examples_discovery.yaml`	Discovery on 95-file specification corpus
Arrowhead	6	`arrowhead_discovery.yaml`	Discovery on Arrowhead framework model
Drone	4	`drone_discovery.yaml`	Discovery on drone system model
HVAC / HSUV	6	`hvac_hsuv_discovery.yaml`	Discovery on HVAC and hybrid SUV models
Vehicle	8	`vehicle_discovery.yaml`	Discovery on vehicle system model

Total: 132 active tasks across 13 task files. An additional 82 archived tasks from earlier experimental phases are in eval/tasks/archive/.

Tool Sets

Tool Set	Tools	Schema Tokens	Description
`cli_search`	`sysml_search`, `read_file`	~250	Search + read. Baseline.
`cli_graph`	+ `sysml_trace`, `sysml_check`, `sysml_query`, `sysml_inspect`	~1120	Graph traversal tools
`cli_render`	cli_graph + `sysml_render`	~1300	Server-side Markdown rendering
`cli_full`	cli_render + `sysml_stat`, `sysml_plan`	~1500	Full tool set including planning

Scoring

Per-field structured scoring with the following types:

Type	Mechanism
`Bool`	Exact boolean match → 1.0 or 0.0
`Float`	Within tolerance (default ±0.05) → 1.0 or 0.0
`Str`	Case-insensitive exact match with qualified-name suffix matching (`A::B` matches `B`)
`StrContains`	Case-insensitive substring match
`ListStr`	Set-based F1 score with threshold (default 0.8). Supports qualified-name matching. Binary: ≥threshold → 1.0, else 0.0

Task score = mean of field scores. Condition score = mean of task scores across N runs.

Reproduction

Prerequisites

Python ≥3.12
uv package manager
nomograph-sysml CLI binary on $PATH
ANTHROPIC_API_KEY and/or OPENAI_API_KEY environment variables

Setup

git clone https://gitlab.com/nomograph/sysml-bench.git
cd sysml-bench
uv sync

Run an experiment

uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search \
    --runs 3 --max-turns 15 \
    --output results/my-experiment.json

With Docker

docker pull registry.gitlab.com/nomograph/sysml-bench:latest
docker run -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
    registry.gitlab.com/nomograph/sysml-bench:latest \
    python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search --runs 1 --max-turns 15 \
    --output /results/my-experiment.json

With guided system prompt

uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_graph \
    --system-prompt-file eval/prompts/guided_discovery.txt \
    --runs 5 --max-turns 15 \
    --output results/guided-experiment.json

With vector search

uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search --vectors \
    --runs 3 --max-turns 15 \
    --output results/vector-experiment.json

Results

Result JSON files are stored separately in nomograph/sysml-bench-results (private — contains model outputs and cost data). Each result file is a JSON document containing run metadata and per-task scored results with cost, token counts, and per-field scoring breakdowns.

When running experiments, output files are written to a local results/ directory (gitignored in this repo).

Dataset

The benchmark tasks and baseline results are available on HuggingFace:

from datasets import load_dataset
tasks = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="tasks")
results = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="results")

See nomograph/sysml-v2-reasoning-benchmark for the full dataset card.

Known Limitations

Corpus size: Primary corpus is 19 files. Results may not generalize to production-scale models (1000+ files). Scaling experiments (O10) show dramatic performance drops at 95 files.
Model versions: All results are from specific model snapshots (claude-sonnet-4-20250514, gpt-4o-2024-08-06, etc.). Results may differ with future model versions.
Stochastic variance: Some tasks show high run-to-run variance (e.g., D12, D13 are bimodal). N=3-5 replication mitigates but does not eliminate this.
ListStr scoring: The set-based F1 scorer handles flat string lists only. Compound answer types (list-of-dicts) require decomposition into separate fields.
Single domain: All tasks are from a SysML v2 model. The observations may not generalize to other modeling languages or domains.

License

MIT — see LICENSE

Citation

@article{dunn2026sysmlbench,
  title={sysml-bench: Evaluating Tool-Augmented {LLMs} on {SysML} v2 Model Comprehension},
  author={Dunn, Andrew},
  year={2026},
  journal={arXiv preprint},
  url={https://gitlab.com/nomograph/sysml-bench},
  note={Nomograph Labs}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Mar 11, 2026

This version

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sysml_bench-0.1.0.tar.gz (308.4 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sysml_bench-0.1.0-py3-none-any.whl (132.8 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file sysml_bench-0.1.0.tar.gz.

File metadata

Download URL: sysml_bench-0.1.0.tar.gz
Upload date: Mar 11, 2026
Size: 308.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sysml_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`36d3bf5a84baca948c4394ae58e9b01ce217e027932645a9ef37c09082772774`
MD5	`c9e129acd7c082de42a70a5a62cf7652`
BLAKE2b-256	`e9b4700a93210e675c5b35c46e2520b8d07feff7cfae3fc09602bff306dfd539`

See more details on using hashes here.

File details

Details for the file sysml_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: sysml_bench-0.1.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 132.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sysml_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`186a6dca576579829c8de4a39b3e50f9e7c1e8b06e41f0088077e8505c4a7b79`
MD5	`bbbd77e884998e09b7e1cbff9a2db7eb`
BLAKE2b-256	`48b446a5e1729932d14215268ae48f0e5d1580d6398d28d938a842c73ba37d4e`

See more details on using hashes here.

sysml-bench 0.1.0

Navigation

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sysml-bench

What This Is

Key Observations

Corpus

Task Categories

Primary corpus (Eve Online Mining Frigate, 19 files)

Multi-corpus discovery (scaling and generalization)

Tool Sets

Scoring

Reproduction

Prerequisites

Setup

Run an experiment

With Docker

With guided system prompt

With vector search

Results

Dataset

Known Limitations

License

Citation

Project details

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes