Skip to main content

Generate a Sublime Text syntax definition for a context-free grammar

Project description

Context-free grammar to Sublime-syntax file

This project produces sublime-syntax highlighting files from a description of a context-free grammar.

It implements a "Generalised Recursive Descent Parser" as described in Generalised recursive descent parsing and follow-determinism by Adrian Johnstone and Elizabeth Scott. It's essentially a non-deterministic LL(1) parser. If the grammar happens to be LL(1) then no backtracking will happen and it's just an LL(1) parser. If the grammar is not LL(1), then alternatives will be tried in sequence, backtracking until one succeeds.

IMPORTANT: The grammar must be non-left-recursive, and also must be follow-determined. If the grammar is left-recursive then the program will complain and alert the user, but I don't know an algorithm to detect whether the grammar is follow-determined. IF THE GRAMMAR IS NOT FOLLOW-DETERMINED, THEN THE LANGUAGE RECOGNIZED BY THE GENERATED SYNTAX WILL SIMPLY NOT MATCH THE INPUT GRAMMAR.

A grammar is follow-determined if whenever a nonterminal X produces both <string> and <string> y <...>, then y is not in the follow set of X. Intuitively, a grammar is follow-determined whenever a single lookahead token is enough to tell us whether it's okay to pop out of the context for X or if we should keep going within X. (i.e. if we've just consumed the prefix <string> and need to decide whether to finish with X or to keep consuming, then the presence or absence of the next token in the follow set of X had better be enough to tell us which option to take, because once we pop out of X then we can't backtrack to try other options anymore.)

Implementation

Sublime syntax files allow one to define contexts; within each context one can match against any number of regular expressions (including lookaheads) and then perform actions like pushing other contexts onto the context stack, pop out of the context, set the scope of the consumed tokens (i.e. instruct Sublime Text that a token is e.g. a function definition and highlight it appropriately), and others. One can also set a branch point and try multiple branches in sequence; if an action taken is to fail that branch point, then the syntax engine backtracks and tries the next branch in the sequence.

See the Wikipedia page on LL parsers for more details on how LL parsers work in general. What I do here is always indicate "success" by pop: 2; i.e. popping twice out of the current context, and failure by pop: 1. Contexts for a given production are pushed onto the stack interleaved by a pop2! context which always pops 2 contexts off the stack. Therefore a failure, which pops once, moves into the "always pop 2" stream until it hits a failure context (to backtrack and try a different branch) or pops all the way out of the current stack.

Related

This project shares the goal of automatically generating a Sublime-syntax file with Benjamin Schaaf's sbnf project. While I started working on this idea before learning about the existence of sbnf, I took a lot of inspiration from that project. In particular the idea of using the extended BNF syntax (allowing *, ?, parenthesized expressions) and passive expressions. More generally, I'm using the exact same .sbnf file format as my input. The implementations here are all my own.

Differences between this project and sbnf

One of sbnf's goals is to "Compile to an efficient syntax, comparable to hand-made ones". That is not explicitly a goal of this project. There may be some small differences in implementation; I list some examples below.

main is not implicitly repeated

In sbnf, the rule main : 'a' ; will match any number of repeated a characters. In sublime-from-cfg, only one a will be matched. The parser will mark an invalid parse, and then reset at the beginning of the next new line (so that the whole rest of the document is not marked invalid).

<> represents an empty production

While sublime-from-cfg automatically rewrites rules involving ? (optional) and * (repetition), you can take extra control of the rule-rewriting by explicitly indicating an empty production via <>. For example, the following rewrite is done automatically:

- a  : b c* d ;
+ a  : b a' ;
+ a' : d
+    | c d ;

But if you prefer a different rewrite, you can explicitly write something like:

a     : b c-rep d ;
c-rep : <>
      | c c-rep ;

Setting sort precedence

Sometimes a regular expression can match more than one part of a language, where for example a generic expression to match any identifier, like [a-zA-Z][_a-zA-Z0-9]*, will also match a reserved word like import. To make sure that the reserved word is always tried first, you can add a "sort" option:

IDENTIFIER = '[a-zA-Z][_a-zA-Z0-9]*'
statement : IDENTIFIER{entity.name, sort: 1} `=` '\d+' `;`
          | 'import'{keyword.operator} IDENTIFIER `;`
          ;

At the moment, the same regular expression can only have one sort value across the whole file (defining it twice will pick one arbitrarily). The default value is 0, and smaller values are tried before larger values. In the example above, 'import' has lower precedence (the default value 0) than IDENTIFIER (value 1), so the syntax engine will try to match 'import' first.

String vs Rule parameters

String parameters and arguments must be specified in ALL_CAPS:

- foobar[scope-1, scope-2] : `foo`{#[scope-1]} `bar`{#[scope-2]} ;
+ foobar[SCOPE_1, SCOPE_2] : `foo`{#[SCOPE_1]} `bar`{#[SCOPE_2]} ;

Rule parameters and arguments aren't supported yet (except that a rule argument to an %include is okay).

TO-DO:

  • Self-host. Accept a convenient text description of a grammar rather than require constructing a Python object by hand. Benjamin Schaaf's sbnf is a project with essentially the same goals as this one, and has a very nice syntax for defining grammars so it'd be nice to allow inputs in that format.
    • Parse the sbnf file format
    • Handle variables and rules
    • Handle parameters
    • Handle *, ?, ~, and ( ... )
    • Handle options for scope, meta_scope
    • Handle options for include-prototype, captures
    • Handle prototype
    • Handle %embed and %include
    • Handle global parameters
    • Handle command-line parameters
  • Detect whether the input grammar is follow-determined. This may be undecidable for all I know.
  • Automatically rewrite rules involving left recursion

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sublime-from-cfg-0.2.0.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

sublime_from_cfg-0.2.0-py2.py3-none-any.whl (18.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sublime-from-cfg-0.2.0.tar.gz.

File metadata

  • Download URL: sublime-from-cfg-0.2.0.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for sublime-from-cfg-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d197df022d092b808266cdc486eeec41ddfa5055538cea4b05758b3f03974294
MD5 c6df437c777a01f3261badc187f15a70
BLAKE2b-256 745b5ce3d1b00cb953c1c14d9bf2628eaf0deaa13c9dce0e14fb5cbcc22e4374

See more details on using hashes here.

File details

Details for the file sublime_from_cfg-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: sublime_from_cfg-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for sublime_from_cfg-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9ec2bd13f414c2b31beaa8e69906c3c8254e564645b4d771c1093086cfb225f7
MD5 7ea202a560b458fca268df45af175636
BLAKE2b-256 84f0ce494930ab2a68fce615404ed07b390ce016e7e5ff43f596d94db3f5bc6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page