Skip to main content

A minimalistic PEG parser

Project description

Parsik

A simplistic PEG (parsing expression grammar) parser.

You provide a grammar for your language, and this module gives you a Parser that you can use to recognize if documents (strings) match the grammar. You can attach callback functions for when various parts of the grammar match, and to help construct some "output" of the parse (e.g. a parse tree).

The library is simple and unoptimized in time and space. And a grammar with left-recursion could put the parser into an infinite-loop. But the implementation is small and straightforward, and there is extensive logging to help analyze the parser's behaviour.

If you're looking for something even slightly more complex or elegant, I recommend parsy (https://github.com/python-parsy/parsy).

Example

from parsik import Parser, ParseFailure,\
                   Any, Char, EOF, Fail, OneOrMore, Optional, R,\
                   Regex, Sequence, Times, ZeroOrMore, silent

# A grammar for 7- or 10-digit phone numbers.
grammar = {
    'PHONENUM': Sequence(
                    Optional('AREACODE'),
                    'THREE', silent(Char('-')), 'FOUR'
                ),
    'AREACODE': Sequence(
                    silent(Char('(')), 'THREE', silent(Regex(r'\) ')),
                    on_match=lambda x: x[0]
                ),
    'THREE':    Regex(r'\d{3}'),    # three digits
    'FOUR':     Regex(r'\d{4}'),    # four digits
}
parser = Parser(grammar, "Phone numbers")

output = parser.parse("PHONENUM", "123-4567")
assert output == [ '123', '4567']

output = parser.parse("PHONENUM", "(555) 123-4567")
assert output == [ '555', '123', '4567']

try:
    parser.parse("PHONENUM", "123-456")
except ParseFailure:
    pass
else:
    assert False

Matchers

Parsik comes with several "Matchers" that you can use to express your grammar: components like Sequence, Optional, ZeroOrMore, Regex, etc. These Matchers can be composed and can refer to other rules in the grammar; even recursively, so long as they aren't "left-recursive" which could cause an infinite-loop.

The included Matchers are:

  • Char: matches a single character.
  • Regex: matches a regular expression.
  • EOF: matches the end of the input string.
  • R: matches a rule that is defined elsewhere in the grammar.
    • but usually you can just use a string to refer to a rule by name.
  • Optional: optionally matches an item.
  • Any: matches any one of a number of possible items.
  • Sequence: matches a sequence of items in a specific order.
  • Times: matches an item a specific number of times, or range of times.
  • ZeroOrMore: matches an item zero or more times.
  • OneOrMore: matches an item one or more times.
  • Fail: if an item matches, then abort the parse.

You can also make your own types of Matchers.

There isn't any special syntax or operator-overloading for expressing the grammar. It's just a dictionary where the keys are rule-names, and the values are Matchers. Some Matchers, like Sequence, are composed of other matchers or rules. They just take the list of child matchers in their constructor. You can also refer to other grammar rules by name, even if it forms a cycle.

Logging

Parsik uses python's standard logging. If you enable it at DEBUG level:

logging.basicConfig(level=logging.DEBUG, format='%(message)s')

then you'll get an elaborate analysis of any parse attempt. For example:

grammar = {
    'MAIN': Sequence('FIRST', 'SECOND', 'THIRD'),
    'FIRST': ZeroOrMore(Regex(r'ab')),
    'SECOND': Any(Char('x'), Char('y')),
    'THIRD': 'FIRST',
}
p = Parser(grammar, "Quick example")
p.parse("MAIN", "ababy")

This will log:

Parsing Quick example {MAIN} against input string 'ababy'.

pos     input      output              ✓|✕ rule            rule details
------- ---------- ------------------- --- --------------- -------------------------
0       'a'                                MAIN            [[{FIRST},{SECOND},{THIRD}]]
0       'a'                                                  {FIRST}
0       'a'                                FIRST               *(r'ab')
0       'a'                                                      r'ab'
0-2     'ab'       'ab'                 ✓                        r'ab'
2       'a'                                                      r'ab'
2-4     'ab'       'ab'                 ✓                        r'ab'
4       'y'                                                      r'ab'
4       ''                              ✕                        r'ab'
0-4     'abab'     '[ab, ab]'           ✓  FIRST               *(r'ab')
0-4     'abab'     '[ab, ab]'           ✓                    {FIRST}
4       'y'                                                  {SECOND}
4       'y'                                SECOND              any(['x', 'y'])
4       'y'                                                      'x'
4       ''                              ✕                        'x'
4       'y'                                                      'y'
4-$     'y'        'y'                  ✓                        'y'
4-$     'y'        'y'                  ✓  SECOND              any(['x', 'y'])
4-$     'y'        'y'                  ✓                    {SECOND}
5       '$'                                                  {THIRD}
5       '$'                                THIRD               {FIRST}
5       '$'                                FIRST                 *(r'ab')
5       '$'                                                        r'ab'
5-$     '$'                             ✕                          r'ab'
5-$     '$'        '[]'                 ✓  FIRST                 *(r'ab')
5-$     '$'        '[]'                 ✓  THIRD               {FIRST}
5-$     '$'        '[]'                 ✓                    {THIRD}
0-$     'a'        '[[ab, ab], y, []]'  ✓  MAIN            [[{FIRST},{SECOND},{THIRD}]]
Parse success: '[[ab, ab], y, []]'.

Each row shows a step in the recursive-descent parse attempt. A Matcher will log a line immediately before and immediately after it attempts to match. So, one line where the ✓|✕ column is blank, and another where it has the result.

The columns are:

  • pos: The position(s) (character #s) in the input document under view.
    • $ means it has reached the end.
  • input: The text of the input document at those positions.
  • output: If a matcher was successful, this is its output.
  • ✓|✕:
    • blank means that the matcher is going to attempt a match.
    • ✓ means that the match was successful.
    • ✕ means that the match failed.
  • rule: The name of the rule, if the matcher is a named rule in the grammar.
  • rule details: A concise representation of what the matcher will attempt to recognize. They display as:
    • Char: 'x'
    • Regex: r'pattern'
    • EOF: EOF
    • R: {rule name}
    • Optional: ?(...)
    • Any: any(...)
    • Sequence: [...]
    • Times: X{min,max}(...)
    • ZeroOrMore: *(...)
    • OneOrMore: +(...)
    • Fail: Fail(...)

License

This package is licensed under the GNU GPL v3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsik-0.9.3.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

parsik-0.9.3-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file parsik-0.9.3.tar.gz.

File metadata

  • Download URL: parsik-0.9.3.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.3

File hashes

Hashes for parsik-0.9.3.tar.gz
Algorithm Hash digest
SHA256 c43829bd01bacd1f74d57dca3e01cc8ffe7f05e2e5712b24cc9f1afbead562b7
MD5 8b1a211098000be218361e1c3dabf70b
BLAKE2b-256 2d1695324b36249cd946dd33e23130a6e6ba4c95e495e22e38baa2ee7c635b0f

See more details on using hashes here.

File details

Details for the file parsik-0.9.3-py3-none-any.whl.

File metadata

  • Download URL: parsik-0.9.3-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.3

File hashes

Hashes for parsik-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a93ab56fd2b7944bb53326ac5451d6672ed5084f9600f0053d61d68f5dfd3755
MD5 a604ec0bf958b56fc0065fa36207a227
BLAKE2b-256 c889eb4a15777cfcced325297a1999beff8def4f87987ad01bff6b774a70de97

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page