Skip to main content

SysML v2 grammar for tree-sitter

Project description

tree-sitter-sysml

pipeline status parse coverage npm crates

Tree-sitter grammar for SysML v2, the next-generation systems modeling language from the OMG.

SysML v2 replaces the diagram-centric SysML v1 with a textual notation designed for Model-Based Systems Engineering (MBSE). This parser turns that textual notation into concrete syntax trees that editors, linters, and developer tools can consume.

Why This Exists

SysML v2 is a large language — roughly 120 grammar rules covering packages, definitions, usages, constraints, requirements, state machines, actions, flows, views, and more. The only existing parser is the Xtext-based pilot implementation from the OMG, which is tightly coupled to Eclipse.

This tree-sitter grammar provides a standalone, incremental parser with no IDE dependency. Our primary use case is embedding it in Rust CLI tools and MCP (Model Context Protocol) servers for AI-assisted systems engineering — but it works anywhere tree-sitter does: Neovim, Helix, Zed, VS Code, Emacs, and any application using the tree-sitter C library.

Status

Parse coverage is tested on every push against 393 real-world SysML v2 files from 8 independent sources (see badge above).

Metric Value
Corpus Tests 192 passing
Negative Tests 18 (12 syntactic, 6 structural)
External File Coverage 393 files across 8 corpora
Bindings C, Rust, Go, Python, Node.js, Swift
Queries highlights, tags, locals, folds, indents

See parse-coverage.md for per-corpus breakdown and details on any unparseable files.

How the Corpus Was Assembled

Most tree-sitter grammars have the luxury of millions of open-source files to test against. SysML v2 does not — the language was published in 2023 and adoption is early. We assembled test material from every public source we could find:

Source Files Description
OMG Training sysml/src/training/ 100 Official tutorial files covering all major constructs
OMG Examples sysml/src/examples/ 95 Additional worked examples from the spec authors
OMG Validation sysml/src/validation/ 56 Validation suite from the reference implementation
OMG Standard Library sysml.library/ 58 Library definitions (KerML + SysML base types)
Sensmetry Advent 44 Community examples from "Advent of SysML v2"
GfSE Models 36 German systems engineering society models
SYSMOD 1 SYSMOD methodology example
Sensmetry SmartHome 3 Smart home hub example
Total 393

The training files were the development target — every grammar change was validated against all 100 training files. The remaining corpora serve as independent validation: the grammar was never specifically tuned to pass them, so their pass rate reflects genuine generalization.

We need more corpus. If you have SysML v2 files (from coursework, research, industry projects, or personal experiments), we would love to test against them. Even files that break the parser are valuable — especially those. See Contributing.

Grammar Approach

The Brute-Force Strategy

This grammar was developed empirically, not derived from the SysML v2 KEBNF specification. The approach:

  1. Start with the simplest possible grammar rules
  2. Try to parse a training file
  3. When it fails, look at the error, add or modify the rule
  4. Regenerate, re-test all files, repeat

This "brute-force" loop ran for hundreds of iterations. The result is a grammar that reliably parses real SysML v2, but makes pragmatic trade-offs that a spec-derived grammar would not.

Trade-offs

Over-acceptance (deliberate). The grammar does not enforce context-sensitive body rules. For example, a control_node (only valid inside action bodies) will parse without error inside a part body. This keeps the grammar simpler and more resilient to spec evolution, at the cost of accepting some invalid programs. Editors and linters should handle semantic validation — the parser's job is to produce a usable tree.

Flat member lists. Rather than maintaining separate member type lists for structural vs. behavioral contexts (which the spec requires), every body accepts a unified _usage_member rule. This avoids exponential conflict growth in the LR parse table.

Expression precedence is approximated. Binary operators use prec.left following standard mathematical convention, which may not match the SysML v2 spec in edge cases.

Could This Be Done Better?

Almost certainly. Some ideas we haven't tried:

  • Derive the grammar from the KEBNF — The SysML v2 specification includes a formal grammar in KEBNF notation. A careful translation to tree-sitter rules could produce a more precise parser, but KEBNF uses features (like ordered alternation) that don't map directly to tree-sitter's GLR parser.
  • Use an external scanner — For constructs like implicit action bodies (brace-less blocks), an external scanner could maintain context state. We avoided this to keep the grammar self-contained.
  • Context-sensitive body rules — Separate member lists per body type (structural, behavioral, etc.) would reject more invalid syntax but at significant grammar complexity cost.
  • Hybrid approach — Use the empirical grammar as a baseline, then systematically tighten it against the KEBNF rule by rule.

If you have experience with tree-sitter grammars for large languages and want to suggest improvements to the approach, we'd welcome the discussion. Open an issue.

Construct Coverage

Category Status Constructs
Packages package, library package, import, alias
Definitions part, item, port, action, state, constraint, requirement, use case, interface, allocation, analysis, case, verification, occurrence, individual, connection, flow, attribute, enumeration, metadata
Usages All definition types as usages, plus ref, end, connect, bind, event occurrence, timeslice, snapshot, variant, exhibit, concern, stakeholder, actor, objective
Specialization :>, specializes, :>>, redefines, subsets, references
Multiplicity [n], [n..m], [n..*], ordered, nonunique
Comments //, /* */, //* */, doc, comment about, locale
Connections connect, bind, interface, allocation
Flows flow, flow def, message, succession flow
Actions first/then, perform, accept/send, if/while/for/loop, assign, terminate
States state def/state, entry/do/exit, transitions with triggers and guards
Constraints constraint def/constraint, assert, require, assume
Requirements requirement def/requirement, satisfy, verify, subject, actor, stakeholder
Expressions Arithmetic, comparison, logical operators, invocations, select (.?), collect (.), index (#)
Views view, viewpoint, rendering, expose, filter
Metadata @metadata, #prefixAnnotation
Variations variation, variant
Calculations calc def/calc, return

Installation

Rust

[dependencies]
tree-sitter-sysml = "0.1"

Node.js

npm install tree-sitter-sysml

Go

import "github.com/nomograph-ai/tree-sitter-sysml/bindings/go"

Python

pip install tree-sitter-sysml

Usage

Rust

use tree_sitter::Parser;

fn main() {
    let mut parser = Parser::new();
    parser.set_language(&tree_sitter_sysml::LANGUAGE.into()).unwrap();

    let source = r#"
        package Vehicle {
            part def Engine {
                attribute horsePower : Real;
            }
            part engine : Engine;
        }
    "#;

    let tree = parser.parse(source, None).unwrap();
    println!("{}", tree.root_node().to_sexp());
}

Node.js

const Parser = require('tree-sitter');
const SysML = require('tree-sitter-sysml');

const parser = new Parser();
parser.setLanguage(SysML);

const tree = parser.parse(`
package Vehicle {
    part def Engine {
        attribute horsePower : Real;
    }
    part engine : Engine;
}
`);

console.log(tree.rootNode.toString());

Python

import tree_sitter_sysml as tssysml
from tree_sitter import Language, Parser

SYSML_LANGUAGE = Language(tssysml.language())
parser = Parser(SYSML_LANGUAGE)

tree = parser.parse(b"""
package Vehicle {
    part def Engine {
        attribute horsePower : Real;
    }
    part engine : Engine;
}
""")

print(tree.root_node.sexp())

Project Layout

tree-sitter-sysml/
├── grammar.js              # The grammar definition (~2400 lines)
├── src/
│   ├── parser.c            # Generated parser (do not edit)
│   ├── grammar.json        # Generated grammar metadata
│   ├── node-types.json     # Generated node type definitions
│   └── tree_sitter/        # Tree-sitter C library headers
├── queries/
│   ├── highlights.scm      # Syntax highlighting queries
│   ├── tags.scm            # Code navigation (symbol tags)
│   ├── locals.scm          # Scope-aware variable resolution
│   ├── folds.scm           # Code folding regions
│   └── indents.scm         # Auto-indentation rules
├── test/
│   ├── corpus/             # 192 tree-sitter corpus tests
│   │   ├── actions.txt     #   Control flow, send, accept, assign
│   │   ├── attributes.txt  #   Attribute definitions and usages
│   │   ├── calculations.txt#   Calc definitions with return
│   │   ├── connections.txt #   Connect, bind, interface, allocation
│   │   ├── constraints.txt #   Constraint definitions and assertions
│   │   ├── definitions.txt #   All definition types
│   │   ├── expressions.txt #   Operators, invocations, special exprs
│   │   ├── flows.txt       #   Flow definitions and messages
│   │   ├── metadata.txt    #   Metadata annotations
│   │   ├── packages.txt    #   Packages, imports, aliases, comments
│   │   ├── requirements.txt#   Requirements, satisfy, verify
│   │   ├── states.txt      #   State machines, transitions
│   │   ├── successions.txt #   First/then succession chains
│   │   ├── usages.txt      #   All usage types
│   │   └── views.txt       #   Views, viewpoints, rendering
│   └── invalid/            # 18 negative tests (should fail to parse)
│       ├── syntactic/      #   12 tests: bad tokens, missing delimiters
│       └── structural/     #   6 tests: wrong nesting contexts
├── examples/               # 5 curated SysML v2 example files
│   ├── vehicle.sysml       #   Part definitions, attributes, ports
│   ├── requirements.sysml  #   Requirements with satisfy/verify
│   ├── state-machine.sysml #   State definitions with transitions
│   ├── use-cases.sysml     #   Use case with actors and objectives
│   └── verification.sysml  #   Verification with test cases
├── bindings/               # Language bindings
│   ├── c/                  #   C header and pkg-config
│   ├── rust/               #   Rust crate (lib.rs)
│   ├── go/                 #   Go module
│   ├── node/               #   Node.js addon + binding test
│   ├── python/             #   Python package + binding test
│   └── swift/              #   Swift package
├── scripts/
│   ├── fetch-corpora.sh    # Download external test corpora
│   ├── test-corpus.sh      # Run parser against external files
│   ├── check-test-count.sh # Verify corpus test count
│   ├── validate-external.sh# Validate against external corpora
│   └── validate-training.js# Validate against OMG training files
├── docs/
│   ├── parse-coverage.md   # Detailed coverage report and edge cases
│   └── prd-pre-submission.md # Development planning document
├── tree-sitter.json        # Tree-sitter configuration
├── package.json            # Node.js package metadata
├── Cargo.toml              # Rust crate metadata
├── pyproject.toml          # Python package metadata
├── go.mod / go.sum         # Go module metadata
├── CMakeLists.txt          # CMake build system
├── Makefile                # Make build system
├── binding.gyp             # Node.js native addon build
├── Package.swift           # Swift package definition
└── eslint.config.mjs       # ESLint config (tree-sitter conventions)

Development

Prerequisites

  • Node.js 18+
  • tree-sitter CLI: npm install -g tree-sitter-cli

Quick Start

git clone https://gitlab.com/nomograph/tree-sitter-sysml.git
cd tree-sitter-sysml
npm install
npx tree-sitter generate   # ~2 minutes — the grammar is large
npx tree-sitter test        # 192 tests

Parse a File

npx tree-sitter parse examples/vehicle.sysml

Test Against External Corpora

bash scripts/fetch-corpora.sh          # Clone all external repos
bash scripts/test-corpus.sh all        # Parse every .sysml file
bash scripts/test-corpus.sh all --errors-only  # Show only failures

Lint

npx eslint grammar.js

Editor Support

Neovim (nvim-treesitter)

require('nvim-treesitter.parsers').get_parser_configs().sysml = {
  install_info = {
    url = 'https://gitlab.com/nomograph/tree-sitter-sysml',
    files = { 'src/parser.c' },
    branch = 'master',
  },
  filetype = 'sysml',
}

Helix

The grammar can be added to languages.toml once published to the tree-sitter org.

Zed

Tree-sitter grammars in the tree-sitter org are automatically available in Zed.

Intended Use: Rust CLI and MCP Tooling

This grammar was built to power a Rust-based CLI and Model Context Protocol (MCP) server for AI-assisted systems engineering. The intended workflow:

  1. Parse SysML v2 models into concrete syntax trees using tree-sitter-sysml
  2. Extract structured information (definitions, relationships, requirements, constraints) via tree-sitter queries
  3. Serve that information to LLMs through MCP, enabling AI assistants to understand and reason about system models
  4. Generate SysML v2 from natural language descriptions, with the parser validating output

The Rust binding (tree-sitter-sysml crate) is the primary integration point. The grammar's over-accepting nature is actually an advantage here — when AI generates SysML, a lenient parser that produces a usable tree (even for slightly malformed output) is more useful than a strict parser that rejects it entirely.

Contributing

Contributions are welcome. See CONTRIBUTING.md for detailed guidelines.

What We Need Most

More corpus files. The biggest risk to this grammar is constructs we haven't seen. If you have SysML v2 files — from any source — please share them (or point us to public repositories). Files that break the parser are especially valuable.

To test your files against the grammar:

npx tree-sitter parse your-file.sysml

If it produces an ERROR node, please open an issue with the file (or a minimal reproducing snippet).

Negative tests. We have 18 tests for syntax that should be rejected. We need more — especially for:

  • Invalid nesting (definitions inside usages, behavioral constructs in structural contexts)
  • Malformed expressions
  • Edge cases around keyword-as-identifier ambiguity

Grammar approach feedback. If you've built tree-sitter grammars for large languages and see a better way to structure ours, we want to hear it. The brute-force empirical approach got us to 98%, but there may be architectural improvements that would make the grammar more maintainable or more precise.

Query improvements. The highlight, tag, and local queries cover all node types, but the fold and indent queries are minimal. Contributions to improve editor integration are welcome.

Priority Areas

Area Impact Effort
Corpus contributions High Low
Negative test cases High Low
Query improvements (folds, indents) Medium Low
Specification alignment documentation Medium Medium
Context-sensitive body rules High High

Known Limitations

  • Over-acceptance: Any member type parses in any body context (see Grammar Approach)
  • 6 unparseable files: 2 intentionally unsupported, 4 regressions from OMG 2026-02 release (see parse-coverage.md)
  • No semantic validation: The parser checks syntax, not type correctness or constraint satisfaction
  • Expression precedence: Approximated with left-association, may differ from spec in edge cases
  • Keyword-as-identifier: Most cases handled, but some ambiguity remains (see parse-coverage.md)

References

Changelog

See CHANGELOG.md for release history.

License

MIT

Author

Andrew Dunn — Nomograph Labs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tree_sitter_sysml-0.1.0.tar.gz (948.3 kB view details)

Uploaded Source

File details

Details for the file tree_sitter_sysml-0.1.0.tar.gz.

File metadata

  • Download URL: tree_sitter_sysml-0.1.0.tar.gz
  • Upload date:
  • Size: 948.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tree_sitter_sysml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a6937a648f3df655b385a0f872f4f60e6e7e61ca674b166c92e917aaf84ad019
MD5 a716e688213204a7785ead98a45a142a
BLAKE2b-256 d626dac7f9c0747a10ca65bedda522d3e7793cc269188deec73f78cad54bef01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page