Skip to main content

Parser combinator library with a BNF-esque interface

Project description

Comber is a parser combinator library for creating text parsers in plain Python with a BNF flavor.

For instance, we could define a simple grammar for calling functions on integer values like this:

from comber import C, rs, inf

keyword = rs(r'[_a-zA-Z][_a-zA-Z0-9]*')@('keyword')
number = rs(r'[0-9]+')@('number', int)
package = keyword[1, inf, ',']
import_statement = C+ 'import' + package
value = keyword | number
function_call = keyword + '(' + value**',' + ')'
assignment = C+ 'let' + keyword + '=' + (function_call | value)
grammar = (import_statement | assignment | function_call)[0, inf]

Our toy grammar can handle simple import statements, variable assignment, and function calling. For example:

import math
import sys.io

let user_number = read()
let double = multiple(user_number, 2)

print(double)

To use our parser, we simply pass a string to it:

from comber import ParseError

code = "add(17, 3)"
try:
    parseState = grammar(code)
    print('Parsed tokens: %s' % parseState.tree)
except ParseError as ex:
    print(ex)

The grammar call returns the final parser state on a successful parse, from which you can retrieve the final parse tree. In our case, it is effectively a list:

['add', '(', 17, 3, ')']

Our grammar only supports variables and integers. Suppose we tried to assign pythong-style string to a variable:

let text = "Here's some text"

If we tried to parse this, we’d get an exception like:

1:12: Unexpected Text: "Here's som. Expected one of: @keyword, @number

Installing

You can get the latest release from pip:

$ pip install comber

or from the source:

$ pip install .

Documentation

Building Parsers

The most basic parsers are string literals (which match exactly), rs for matching a regular expression, and cs for matching any of the strings of an iterable.

The rs parser can be seen in our toy grammar above. Standard Python regular expressions are supported. The entire match is used as the result of the parse, regardless of groupings etc.

The cs parse takes any iterable of strings. The first string that exactly matches (the start of) the input text. If none of the strings match, the parse fails. Since strings are themselves iterables, we could add addition and subtraction to our toy grammar like this:

from comber import C, cs

addition = value + cs('+-') + value
assignment = C+ 'let' + keyword + '=' + (addition | function_call | value)

String literals can only be used in combination with other parsers. If a string literal would start a sequence, or otherwise appear alone, make use of the C parser.

The C Parser

The C parser, on its own, consumes no text, produces no tokens, and always succeeds. It’s most useful for starting a parser that would otherwise begin with a string literal. E.g. this:

'let' + keyword + '=' + (function_call | value)

would actually throw a Python error because ‘let’ isn’t really a parser - yet! That’s where C comes in:

C+ 'let' + keyword + '=' + (function_call | value)

C starts off the sequence, so we can use any combination of parsers and string literals we like from there. It works similarly with alternatives, so if we wanted to allow set to be used as a synonym for let, we might do:

(C| 'let'|'set') + keyword + '=' + (function_call | value)

C can also be used to wrap a parser to protect it from optimization; for instance, embedding one sequence or alternative set inside another. If, for instance, we extended our grammar to allow a bare value to be a whole statement:

value = (keyword | number)@'value'
grammar = (import_statement | assignment | function_call | value)[0, inf]

Basic Combinators

Parsers can be combined in series with +:

name + address + pet

A sequence of parsers is evaluated left to right, each consuming text before the next is evaluated. If at any point in the sequence a parser fails, the entire sequence fails.

A set of alternatives is built with |:

name | idnumber | location

Alternatives are considered left to right, with the first successful match being the match for the entire set. Be careful! This means that for some sets of alternatives, the “obvious” parser may not be the one used, simply because it came after another match.

Both sequences and alternatives will flatten like combinators, such that:

name = firstname + lastname
salutation = C+ 'Hello' + name + '!'

is equivalent to:

salutation = C+ 'Hello' + firstname + lastname + '!'

If you need to mantain the logical separation (to parse correctly, or maintain the name of a subparser), wrap the subparser with C:

name = C(firstname + lastname)
salutation = C+ 'Hello' + name + '!'

Repetition

The most flexible option for specifying repetition is brackets:

keyword[0, 10, ',']

The above would parse keyword zero to ten times, separated by a comma. The separator is optional - without it, the result would simply parse keyword zero to ten times.

We could also specify parsing an exact number of times:

keyword[10]

Or, with a separator:

keyword[10, None, ',']

Infinity - math.inf - is a valid maximum value. For convenience, it can be imported directly from Comber:

from comber import inf

param_list = keyword[0, inf, ',']

There are several convenience combinators for common types of repetition.

For zero or more with a separator, using *:

parser*','

Or zero or more without a separator, using the unary +:

+parser

You can declare a parser as optional with ~:

~parser

Recursive Grammars

Consider a program in our toy grammar:

let word = exp(2, 16)
let maxint = minus(word, 1)

It’d be nice to simplify this by eliminating the variable “word” and pass the exp() call directly to minus(), but to allow that, we need to extend our grammar to consider a function call to be a value so we can use one as a function argument. But to do that, we’d need to use function_call in our definition of value - but function_call likewise needs to reference value.

To solve this issue, we can define one of them as defer.

from comber import C, rs, inf, defer

...

value = defer()
function_call = keyword + '(' + value**',' + ')'
value.fill(keyword | number | function_call)

This way, we can refer to value wherever we want, and only define its meaning when we’re ready. We can safely build fairly complex grammars this way, but be wary of performance.

Controlling Space

Before each subparser is run, Comber consumes leading whitespace. By default, this is spaces, tabs, and newlines. You can change this in two ways: set the whitespace property on the outermost parser to a string containing the new whitespace characters, or likewise passing a string as the second argument to the parser. In either case, the value None will disable the feature altogether.

Building Parse Trees

When you call a parser on a some text, they return a State object containing the resultant parse tree (State.tree).

By default, parsers output a “flat” tree - a list of strings parsed by the “leaf” parsers (i.e. string literals, rs, and cs).

To build a more useful parse tree, you have to provide emitters. Our toy grammar contains a simple one that converts integers in the input to Python int values:

number = rs(r'[0-9]+')@('number', int)

So if we ran our parser on the input “let foo = 5” the resulting state’s tree property would be ["let", "foo", "=", 5]. But it’d be more useful if it resulted in some kind of “let” object (that could execute the assignment, or be fed to a VM, or whatever else). We could define one we can use as a emitter for our let statement like this:

class Let:
    def __init__(self, let, variable, eq, value):
        self.variable = variable
        self.value = value
        # We can ignore `let` and `eq` (which will always contain "let" and "=", respectively).

    def __repr__(self):
        return f'Let({self.variable}, {self.value})'

and redefine the assignment rule like:

assignment = (C+ 'let' + keyword + '=' + (function_call | value))@Let

Now if rerun the parser on “let foo = 5”, we get [Let(foo, 5)].

As we did with number, you can also combine an emitter with a name, to improve error messages:

assignment = (C+ 'let' + keyword + '=' + (function_call | value))@("assignment", Let)

Performance

Under the covers, Comber is essentially a recursive descent parser. It’s best suited for relatively shallow grammars parsing small amounts of text.

You can gain a bit more speed by calling analyze() on the top-level/outermost parser. This modifies the parser (and subparsers) in place:

grammar.analyze()
grammar('max(1, 4)')

TODO

Available Operators

Operators that Python allows to overridden

Operator

Method

Current use

+

__add__

sequences

|

__or__

selection

[ ]

__getitem__

repeat

@

__matmul__

names and internalization

<

__lt__

>

__gt__

<=

__le__

>=

__ge__

==

__eq__

!=

__ne__

is

_is

is not

is_not

-

__sub__

%

__mod__

*

__mul__

zero or more, with provided separator

**

__pow__

/

__truediv__

//

__floordiv__

&

__and__

^

__xor__

<<

__lshift__

>>

__rshift__

in

__contains__

Unary operators:

Operator

Method

Current use

~

__invert__

optional

not

__not__

-

__neg__

+

__pos__

zero or more

And:

()        __call__   parse a string

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comber-1.1.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

comber-1.1.1-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file comber-1.1.1.tar.gz.

File metadata

  • Download URL: comber-1.1.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for comber-1.1.1.tar.gz
Algorithm Hash digest
SHA256 39a18d5de04487b1e353e23a6a6d45b6a67ba03daf4900e9364485fcc4729af3
MD5 3d9e9437a92b65bb545840d89e5f8dd3
BLAKE2b-256 79b66535af895f1ee563fe0645921c64bf17c6f7b358ed261d382710cf761ce3

See more details on using hashes here.

Provenance

The following attestation bundles were made for comber-1.1.1.tar.gz:

Publisher: python-publish.yml on faboo/comber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file comber-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: comber-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for comber-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8a7f1f695cc42653382a6f35f258de6f7df7f12537fa07cc830d050f468f7aea
MD5 f8de069c5fb7d2f124b555bd8fe52c06
BLAKE2b-256 33333e5f56183f99b0818042e081cb33d3b96bdf2f7a9708ad416806d9b6c646

See more details on using hashes here.

Provenance

The following attestation bundles were made for comber-1.1.1-py3-none-any.whl:

Publisher: python-publish.yml on faboo/comber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page