Skip to main content

A self-hosting parser generator with a modal lexer and LR(1) parser

Project description

Boho

A self-hosting parser generator for Python. Define your grammar in a concise EBNF-based metalanguage and Boho will generate a modal lexer (DFA-based) and an LR(1) parser that produce a clean syntax tree.

Installation

pip install boho

Quick start

from boho import Boho
from boho.interpreter import Interpreter

grammar = '''
start: sum

sum: sum "+" prod
   | prod

prod: prod "*" INT
    | INT

INT: @INT

%ignore " "
'''

b = Boho(grammar)
tree = b("2 + 3 * 4")
print(tree.pretty())

Output:

start:
  sum:
    sum:
      prod:
        'INT' '2'
    prod:
      prod:
        'INT' '3'
      'INT' '4'

Writing an interpreter

Subclass Interpreter and define methods matching your nonterminal names:

class Calc(Interpreter):
    def start(self, tree):
        return self(tree[0])

    def sum(self, tree):
        return sum(self(c) for c in tree)

    def prod(self, tree):
        result = int(self(tree[0]))
        for i in range(1, len(tree)):
            result *= int(self(tree[i]))
        return result

calc = Calc()
print(calc(tree))  # 14

The Boho metalanguage

Terminal definitions

Terminals are named in UPPER_CASE and can be described three ways:

PLUS: "+"               // string literal
NUMBER: /\d+(\.\d+)?/   // regular expression
STRING: @STR             // built-in description (@INT, @FLOAT, @STR)

Terminal descriptions can also be used directly (unnamed) in grammar rules -- they will be pruned from the syntax tree.

Prefixing a terminal name with _ (e.g. _WHITESPACE) prunes it from the tree despite being named.

Grammar rules

Nonterminals use lower_case names. Alternatives are separated with |:

value: NAME | NUMBER
assignment: NAME "=" value

EBNF extensions:

items: item+              // one or more
list: (item ",")*  item   // zero or more (with grouping)
optional: modifier?       // optional

Inline aliases with ->:

expr: term "+" term -> addition
    | term "-" term -> subtraction

Fake terminals

A name like COMMENT_ (uppercase ending with _) defines a fake terminal -- described like a nonterminal but collapsed into a single token. Useful for structures that regular expressions cannot describe (e.g. nested block comments).

Lexer modes

Following ANTLR's approach, a modal lexer is supported. Terminals before the first #mode belong to all modes.

LBRACE: "{" -> +inner    // push mode
RBRACE: "}" -> -         // pop mode

#inner
CONTENT: /[^{}]+/
Syntax Effect
-> +mode push mode onto stack
-> - pop one mode
-> -N pop N modes
-> -- clear the stack
-> mode replace top of stack

Ignoring tokens

%ignore " "
%ignore /\/\/[^\n]*/    // ignore line comments

Project structure

boho/
  __init__.py            # exports the Boho class
  boho.py                # main orchestrator
  lexer.py               # modal finite-automaton lexer
  lexer_generator.py     # terminal descriptions -> lexer DFA
  parser.py              # LR(1) shift-reduce parser
  parser_generator.py    # grammar -> LR(1) parse tables
  grammar_interpreter.py # interprets the Boho metalanguage
  interpreter.py         # base Interpreter class (visitor pattern)
  objects.py             # Token, Tree, LR1Item dataclasses
  regex.py               # regex-to-DFA via greenery
  grammars.py            # pre-compiled Boho grammar tables
docs/                    # English documentation
slo-dokumentacija/       # Slovenian documentation
examples/                # usage examples
tests/                   # test suite

How it works

  1. Your grammar string is parsed by Boho's own (bootstrapped) parser.
  2. Terminal descriptions are compiled into merged DFAs for a modal lexer.
  3. Grammar rules are compiled into LR(1) parse tables.
  4. At runtime, input text is tokenized by the lexer and then parsed into a Tree of Token leaves.

Boho is self-hosting -- its own metalanguage is specified in Boho (see examples/boho_in_boho.py).

API

Boho(grammar, log=False)

Create a parser from a grammar string. Set log=True to print the generated lexer and parser tables.

boho(text, log=False) -> Tree

Parse input text. Returns a Tree with Token leaves. Set log=True for step-by-step tracing.

Tree

  • tree.name -- nonterminal name
  • tree.children -- list of Tree / Token children
  • tree.value -- concatenated text of all descendant tokens
  • tree.pretty() -- indented string representation
  • Supports iteration and indexing (tree[0], for child in tree)

Token

  • token.name -- terminal name
  • token.value -- matched text
  • token.line, token.col -- source location

Interpreter

Base class for tree walkers. Subclass it and define methods named after your nonterminals. The default behavior for unhandled nodes: tokens return their value, trees return a list of children's results.

Dependencies

  • greenery -- regex-to-FSM conversion
  • Python 3.10+ (uses match statements and X | Y type unions)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boho-0.1.0.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boho-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file boho-0.1.0.tar.gz.

File metadata

  • Download URL: boho-0.1.0.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for boho-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8c2e5b783ceb1341e63dc1367c74cff09db2da27503cf2efa4be41871da84648
MD5 9e605cf3a9283c7710efb0755339a8f8
BLAKE2b-256 282e8215352788449066fd375f37a4fb492839fb6e91535e3b62f3ef4f2f8863

See more details on using hashes here.

File details

Details for the file boho-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: boho-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for boho-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2272c7fca7ac12c20919a4ca1258c7677825b3bafee561be52ccddc79a7e0e9e
MD5 e8756985997ca3ba8b2a21b6dd831cd4
BLAKE2b-256 c7470c5117fec41dfd0b9092e219d8f70571a85d64be531f5b53da05976aeda6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page