Skip to main content

Simple LR parser generator and parser for easy integration

Project description

Parsion - A simple LR(1) parser

Parsion is intended to be a simple drop-in parser generator, lexer and parser for grammars in python.

It's main intention is to make it possible to load and create a grammar for simple expression parsers without using code generation.

Installation

parsion is available on pypi

To install:

pip install parsion

Usage

To define a language for parsing, three components are required:

  • Lexer rules – Specify how to tokenize the input by splitting it into meaningful units (tokens).
  • Grammar – Defined as Python objects, similar in concept to Backus-Naur Form (BNF).
  • Reducers/Handlers – Functions that handle reductions, with one handler per grammar rule.

A language is defined as a Python class containing these rules.

The lexer rules are implemented as a series of regular expressions that consume the input string from the beginning. The first regex that matches the start of the string determines the next token, which is then extracted from the input before the process repeats. Each lexer rule also contains a method to postprocess the input. For example convert an integer to int, or unpack a quoted string.

The parser rules is defined as a list of rules, where each rule has three parts:

  • A handler to be called when the rule is reduced
  • The symbol that will be the result of the rule
  • A sequence of symbols that is the input to the rule

The sequence of symbols is a space seperated string, for readability. If a symbol starts with _, the _ is dropped, and the value of the corresponding token will not be passed to the handler.

A rule specifying the handler None must have exactly one symbol without _, and the input value will be passed directly as result. It is useful for defining rules to set operator precedence.

Parsing begins from the grammar rule that produces the entry token.

To parse an input string, call the parse(input) method. It will apply the defined handlers and return the result of the entry node.

If an error occurs during parsing, an exception is raised.

Example

For a bigger example, see: example.py

from parsion import Parsion

class ExprLang(Parsion):
    LEXER_RULES = [
        (None,       r'(\s+)',                   lambda x: None),
        ('INT',      r'([0-9]+|0x[0-9a-fA-F]+)', lambda x: int(x, base=0)),

        ('+',        r'(\+)',                    lambda x: None),
        ('-',        r'(-)',                     lambda x: None),
        ('*',        r'(\*)',                    lambda x: None),
        ('/',        r'(\/)',                    lambda x: None),

        ('(',        r'([\(])',                  lambda x: None),
        (')',        r'([\)])',                  lambda x: None)
    ]
    GRAMMAR_RULES = [
        ('entry',       'entry',        'expr'),
        (None,          'expr',         'expr1'),
        ('expr_add',    'expr1',        'expr1 _+ expr2'),
        ('expr_sub',    'expr1',        'expr1 _- expr2'),
        (None,          'expr1',        'expr2'),
        ('expr_mult',   'expr2',        'expr2 _* expr3'),
        ('expr_div',    'expr2',        'expr2 _/ expr3'),
        (None,          'expr2',        'expr3'),
        ('expr_neg',    'expr3',        '_- expr4'),
        (None,          'expr3',        'expr4'),
        ('expr_int',    'expr4',        'INT'),
        (None,          'expr4',        '_( expr _)'),
    ]
    def expr_add(self, lhs, rhs):
        return lhs + rhs

    def expr_sub(self, lhs, rhs):
        return lhs - rhs

    def expr_mult(self, lhs, rhs):
        return lhs * rhs

    def expr_div(self, lhs, rhs):
        return lhs // rhs

    def expr_neg(self, v):
        return -v

    def expr_int(self, v):
        return v

if __name__ == '__main__':
    # Generate the parser, and parsing tables
    expr_lang = ExprLang()

    # Run the parser
    print( expr_lang.parse('12 * (3 + 7)' )) # prints "120"

    # Run the parser again. Same table generation is reused.
    print( expr_lang.parse('2 + -4 * -10' )) # prints "42"

Error recovery

To isolate error handling, an error rule can be created.

Given the grammar:

from parsion import Parsion


class ExprErrorLang(Parsion):
    LEXER_RULES = [
        (None,       r'(\s+)', lambda x: None),
        ('INT',      r'([0-9]+|0x[0-9a-fA-F]+)', lambda x: int(x, base=0)),

        ('+',        r'(\+)', lambda x: None),
        ('-',        r'(-)', lambda x: None),
        ('*',        r'(\*)', lambda x: None),
        ('/',        r'(\/)', lambda x: None),

        ('(',        r'([\(])', lambda x: None),
        (')',        r'([\)])', lambda x: None),
        (';',        r'(;)', lambda x: None)
    ]
    GRAMMAR_RULES = [
        ('entry',       'entry',        'stmts'),
        ('stmts_list',  'stmts',        'stmt _; stmts'),
        ('stmts_tail',  'stmts',        'stmt'),

        # proxy statement, to be able to isolate errors to top level
        (None,          'stmt',         'expr'),
        ('stmt_error',  'stmt',         '$ERROR'),

        (None,          'expr',         'expr1'),
        ('expr_add',    'expr1',        'expr1 _+ expr2'),
        ('expr_sub',    'expr1',        'expr1 _- expr2'),
        (None,          'expr1',        'expr2'),
        ('expr_mult',   'expr2',        'expr2 _* expr3'),
        ('expr_div',    'expr2',        'expr2 _/ expr3'),
        (None,          'expr2',        'expr3'),
        ('expr_neg',    'expr3',        '_- expr4'),
        (None,          'expr3',        'expr4'),
        ('expr_int',    'expr4',        'INT'),
        (None,          'expr4',        '_( expr _)'),
    ]

    def stmts_list(self, expr, list):
        return [expr] + list

    def stmts_tail(self, expr):
        return [expr]

    def expr_add(self, lhs, rhs):
        return lhs + rhs

    def expr_sub(self, lhs, rhs):
        return lhs - rhs

    def expr_mult(self, lhs, rhs):
        return lhs * rhs

    def expr_div(self, lhs, rhs):
        return lhs // rhs

    def expr_neg(self, v):
        return -v

    def expr_int(self, v):
        return v

    def stmt_error(self, gen, start, pos, end, expect):
        return None

A valid input would then be

1 + 3 + 5 + 7; 12*13*14; 313-13

Resulting in a list:

[16, 2184, 300]

The language contains three separate statements. To isolate errors within a single statement, the special rule generated by $ERROR will be used.

Upon a parse error, all tokens will be accumulated up until a possible token following the closest $ERROR rule. Then the error handler will be called, to return a placeholder for the generation, in this case stmt, and parsing will continue.

If no error handler are defined, or error happens outside of defined error handlers, an exception will be raised instead, and parsing will fail.

Given the above grammar and error handler, the following input:

An input with an error in the middle stmt:

1 + 3 + 5 + 7; 12+*13*14; 313-13

Will then result in:

[16, None, 300]

None is there the result of the error handler.

Default Error isolation

Without specifying an error handler, an exception will be raised of type ParsionParseError.

The exception contains the field pos containing the location in the input string of the erroneous token. It also contains start and end, which includes the scope of where the error occurs.

It is possible in the example above to instead specify the error handler default_error instead of stmt_error. In that case, an exception will still be raised, but the start and end fields will better contain the erroneous parse tree.

Precalculated tables

For bigger languages, it may be motivated to actually precalculate the parse tables upon packaging. For that purpose, there is a class ParsionStatic which doesn't invoke the parser generation, but takes the raw parse tables as input.

The interface for that method is to be defined and documented. Open an issue if interested in that feature.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsion-0.0.6.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsion-0.0.6-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file parsion-0.0.6.tar.gz.

File metadata

  • Download URL: parsion-0.0.6.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for parsion-0.0.6.tar.gz
Algorithm Hash digest
SHA256 874579340165d7d6ad34aa2ed5b6be4c574cb7fd5f77edaf81ad6a8c90dad41f
MD5 a61aef65feb05a6e5b7af75891133531
BLAKE2b-256 ed13be3b85ea67d57c4c102183e6e61d333d6d8fd8197f93e5a554aad905807e

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsion-0.0.6.tar.gz:

Publisher: python-publish.yml on pengi/parsion

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parsion-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: parsion-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for parsion-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a5c82ae4e61c1c269b1997fac2eb1194035ee4a2742500c6c6e67b642c97681a
MD5 6877991923dcd7a7f9ed89cc450e9f7d
BLAKE2b-256 6cf91f8b23700ac11d0d12862eb244a153508121d81471362dc10e742d31243c

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsion-0.0.6-py3-none-any.whl:

Publisher: python-publish.yml on pengi/parsion

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page