Skip to main content

A self-hosting parser generator with a modal lexer and LR(1) parser

Project description

Boho

PyPI version Python

A self-hosting parser generator for Python. Define your grammar in a concise EBNF-based metalanguage and Boho will generate a modal lexer (DFA-based) and an LR(1) parser that produce a clean syntax tree.

Installation

pip install boho

Quick start

from boho import Boho
from boho.interpreter import Interpreter

grammar = '''
start: sum

sum: sum "+" prod
   | prod

prod: prod "*" INT
    | INT

INT: @INT

%ignore " "
'''

b = Boho(grammar)
tree = b("2 + 3 * 4")
print(tree.pretty())

Output:

start:
  sum:
    sum:
      prod:
        'INT' '2'
    prod:
      prod:
        'INT' '3'
      'INT' '4'

Writing an interpreter

Subclass Interpreter and define methods matching your nonterminal names:

class Calc(Interpreter):
    def start(self, tree):
        return self(tree[0])

    def sum(self, tree):
        return sum(self(c) for c in tree)

    def prod(self, tree):
        result = int(self(tree[0]))
        for i in range(1, len(tree)):
            result *= int(self(tree[i]))
        return result


calc = Calc()
print(calc(tree))  # 14

The Boho metalanguage

Terminal definitions

Terminals are named in UPPER_CASE and can be described three ways:

PLUS: "+"               // string literal
NUMBER: /\d+(\.\d+)?/   // regular expression
STRING: @STR             // built-in description (@INT, @FLOAT, @STR)

Terminal descriptions can also be used directly (unnamed) in grammar rules -- they will be pruned from the syntax tree.

Prefixing a terminal name with _ (e.g. _WHITESPACE) prunes it from the tree despite being named.

Grammar rules

Nonterminals use lower_case names. Alternatives are separated with |:

value: NAME | NUMBER
assignment: NAME "=" value

EBNF extensions:

items: item+              // one or more
list: (item ",")*  item   // zero or more (with grouping)
optional: modifier?       // optional

Inline aliases with ->:

expr: term "+" term -> addition
    | term "-" term -> subtraction

Fake terminals

A name like COMMENT_ (uppercase ending with _) defines a fake terminal -- described like a nonterminal but collapsed into a single token. Useful for structures that regular expressions cannot describe (e.g. nested block comments).

Lexer modes

Following ANTLR's approach, a modal lexer is supported. Terminals before the first #mode belong to all modes.

LBRACE: "{" -> +inner    // push mode
RBRACE: "}" -> -         // pop mode

#inner
CONTENT: /[^{}]+/
Syntax Effect
-> +mode push mode onto stack
-> - pop one mode
-> -N pop N modes
-> -- clear the stack
-> mode replace top of stack

Ignoring tokens

%ignore " "
%ignore /\/\/[^\n]*/    // ignore line comments

Project structure

boho/
  __init__.py            # exports the Boho class
  boho.py                # main orchestrator
  lexer.py               # modal finite-automaton lexer
  lexer_generator.py     # terminal descriptions -> lexer DFA
  parser.py              # LR(1) shift-reduce parser
  parser_generator.py    # grammar -> LR(1) parse tables
  grammar_interpreter.py # interprets the Boho metalanguage
  interpreter.py         # base Interpreter class (visitor pattern)
  objects.py             # Token, Tree, LR1Item dataclasses
  regex.py               # regex-to-DFA via greenery
  grammars.py            # pre-compiled Boho grammar tables
docs/                    # English documentation
slo-dokumentacija/       # Slovenian documentation
examples/                # usage examples
tests/                   # test suite

How it works

  1. Your grammar string is parsed by Boho's own (bootstrapped) parser.
  2. Terminal descriptions are compiled into merged DFAs for a modal lexer.
  3. Grammar rules are compiled into LR(1) parse tables.
  4. At runtime, input text is tokenized by the lexer and then parsed into a Tree of Token leaves.

Boho is self-hosting -- its own metalanguage is specified in Boho (see examples/boho_in_boho.py).

API

Boho(grammar, log=False)

Create a parser from a grammar string. Set log=True to print the generated lexer and parser tables.

boho(text, log=False) -> Tree

Parse input text. Returns a Tree with Token leaves. Set log=True for step-by-step tracing.

Tree

  • tree.name -- nonterminal name
  • tree.children -- list of Tree / Token children
  • tree.value -- concatenated text of all descendant tokens
  • tree.pretty() -- indented string representation
  • Supports iteration and indexing (tree[0], for child in tree)

Token

  • token.name -- terminal name
  • token.value -- matched text
  • token.line, token.col -- source location

Interpreter

Base class for tree walkers. Subclass it and define methods named after your nonterminals. The default behavior for unhandled nodes: tokens return their value, trees return a list of children's results.

Dependencies

  • greenery -- regex-to-FSM conversion
  • Python 3.10+ (uses match statements and X | Y type unions)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boho-0.1.1.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boho-0.1.1-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file boho-0.1.1.tar.gz.

File metadata

  • Download URL: boho-0.1.1.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for boho-0.1.1.tar.gz
Algorithm Hash digest
SHA256 349499e182f71d7bcddfc68d514ba611340b04e9d6b30149283f733dea82ed84
MD5 b6bdbf71b22986444eb21aedf36c16e7
BLAKE2b-256 61eb4a4f49db4c122cf67d4210839d32a74847cfd36fd7c76934afdda2a490ca

See more details on using hashes here.

File details

Details for the file boho-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: boho-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for boho-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d7ebf87a99737e4c164689bc81c2f7112b78ed47c37ff6e9bb54c69123eed539
MD5 1d5e5d781112742c262a7334f96ef501
BLAKE2b-256 3a43b92e5ee464450226e53ebcb0c438c0daa78cd34679b81183e9465d16f734

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page