Skip to main content

Python library for building Parsers and Lexers Easily

Project description

parsergen

A simple library for creating parsers and lexers.

Quickstart

pip install parsergen

Defining a Lexer

Tokens have different regular expressions. They can also have modifier functions, for example the INT tokens get their values turned into an int.

from parsergen import *
class CalcLexer(Lexer):
    
    @token(r"0x[0-9a-fA-F]+", r"[0-9]+")
    def INT(self, t):
        if t.value.startswith("0x"):
            t.value = int(t.value[2:], base=16)
        else:
            t.value = int(t.value)
        return t

    ADD    =  r"\+"
    SUB    =  r"\-"
    POW    =  r"\*\*" # must be first, as is longer than 'MUL' token!
    MUL    =  r"\*"
    DIV    =  r"\/"
    SET    =  r"set"
    TO     =  r"to"
    ID     =  r"[A-Za-z_]+"
    LPAREN =  r"\("
    RPAREN =  r"\)"
    
    ignore = " \t"
    ignore_comment = r"\#.*"

Creating a Parser

Grammar Expressions

Grammar Expressions describe the syntax that can be parsed. For our basic example calculator, you will get a terminal to type math expressions

> 2 + 3 * 4
14
> (2 + 3) * 4
20
> 2 ** 2 ** 3
256

It is important that the precedence of the arithmetic operators is correct, we have to account for this when designing our grammar rules. Here is the grammar:

statement       :  assign | expr
assign          :  SET ID TO expr
expr            :  prec3
prec3           :  prec2 (ADD | SUB prec2)*
prec2           :  prec1 (MUL | DIV prec1)*
prec1           :  factor (POW prec1)?
factor          :  INT | ID
factor          :  LPAREN expr RPAREN

the rules prec3 and prec2 are left associative, whereas prec1 is right associative because it implements the pow operator We can then define our parser.

class CalcParser(Parser):

    tokens = CalcLexer.tokens
    starting_point = "statement"

    def __init__(self):
        self.names = {}

    @grammar("assign | expr")
    def statement(self, p):
        print(p[0])
    
    @grammar("SET ID TO expr")
    def assign(self, p):
        self.names[p[1]] = p[3]
    
    @grammar("prec3")
    def expr(self, p):
        return p[0]
    
    @grammar("prec2 (ADD | SUB prec2)*") # left associative
    def prec3(self, p):
        r = p[0]
        for op, num in p[1]:
            if op == "+":
                r += num
            else:
                r -= num
        return r
    
    @grammar("prec1 (MUL | DIV prec1)*") # left associative
    def prec2(self, p):
        r = p[0]
        for op, num in p[1]:
            if op == "*":
                r *= num
            else:
                r /= num
        return r
    
    @grammar("factor (POW prec1)?") # right associative
    def prec1(self, p):
        if p[1]:
            return p[0] ** p[1][1]
        return p[0]
    
    @grammar("INT")
    def factor(self, p):
        return p[0]
    
    @grammar("ID")
    def factor(self, p):
        try:
            return self.names[p[0]]
        except KeyError:
            raise Exception(f"variable '{p[0]}' is not defined.")

    @grammar("LPAREN expr RPAREN")
    def factor(self, p):
        return p[1]

# We can then create a simple runtime loop
l = CalcLexer()
p = CalcParser()

while True:
    s = input("> ")
    l_result = l.lex_string(s)
    try:
        p.parse(l.lex_string(s))
    except Exception as e:
        print(e)

Handling Newlines

The Lexer, by default knows nothing about line numbers. You have to tell it what to do.

class MyLexer(Lexer):
    @token(r"\n+")
    def NEWLINE(self, t):
        self.lineno += len(t.value)
        self.column = 0
        return t
    ...

Overcoming issues with left recursion

Recursive descent parsers are unable to handle direct or indirect left recursion. This is an issue when writing expressions for left associative operators. The following example is directly left recursive:

expr  :  expr PLUS term

and when attempting to process this rule it will fall into an infinite loop. There are different ways to solve this problem, my solution is below:

expr  :  term (PLUS term)*

The disadvantage to this is that there is then some processing required after the pattern matching to reach the original desired strucutre or action.

@grammar("term (PLUS term)*")
def expr(self, p):
    rv = p[0]
    for op, term in p[1]:
        rv += term
    
    return rv

See here for more details.

Writing expressions for right-associative operators

Some operators are right associative, for example the ** operator. Right recursion can be implemented more normally in the grammar expression:

expr  :  term (POW expr)?

This behaves as expected, after pattern matching you do have to perform a check in your code as seen next:

@grammar("term (POW expr)?")
def expr(self, p):
    if p[1]:
        return p[0] ** p[1][1]
    return p[0]

Printing the grammar for your parser

It is sometimes helpful to see the entire grammar for you parser. This can be done as shown below:

from parsergen import get_grammar
print(get_grammar(CalcParser))

See example_calc.py and example.py for more examples, or look at the source code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsergen-1.0.3.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

parsergen-1.0.3-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file parsergen-1.0.3.tar.gz.

File metadata

  • Download URL: parsergen-1.0.3.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.3

File hashes

Hashes for parsergen-1.0.3.tar.gz
Algorithm Hash digest
SHA256 0aa461cdb961395289f8fcc031d9de76122bfacef8216b9e29a1b8c91e716933
MD5 8149ed855d35d1678a9dbe835a81899e
BLAKE2b-256 1a49091e444f5817d522fec7520dde0d33bb4caec7037c1647ac850793e75e49

See more details on using hashes here.

File details

Details for the file parsergen-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: parsergen-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.3

File hashes

Hashes for parsergen-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 59dca524c6bcb07973ccd93eceef5525abf821e898699b828c1262115cd9e90c
MD5 699c02d28cf023b0e916f9b4efbafa36
BLAKE2b-256 c2aee5d62cf02bccd3bb01d424bc0c8e16dd95e3d42c6f87292cfc1de9b7de44

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page