Skip to main content

A self-hosting parser generator with a modal lexer and LR(1) parser

Project description

Boho

PyPI version Python

A self-hosting parser generator for Python. Define your grammar in a concise EBNF-based metalanguage and Boho will generate a modal lexer (DFA-based) and an LR(1) parser that produce a clean syntax tree.

Installation

pip install boho

Quick start

from boho import Boho
from boho.interpreter import Interpreter

grammar = '''
start: sum

sum: sum "+" prod
   | prod

prod: prod "*" INT
    | INT

INT: @INT

%ignore " "
'''

b = Boho(grammar)
tree = b("2 + 3 * 4")
print(tree.pretty())

Output:

start:
  sum:
    sum:
      prod:
        'INT' '2'
    prod:
      prod:
        'INT' '3'
      'INT' '4'

Writing an interpreter

Subclass Interpreter and define methods matching your nonterminal names:

class Calc(Interpreter):
    def start(self, tree):
        return self(tree[0])

    def sum(self, tree):
        return sum(self(c) for c in tree)

    def prod(self, tree):
        result = int(self(tree[0]))
        for i in range(1, len(tree)):
            result *= int(self(tree[i]))
        return result


calc = Calc()
print(calc(tree))  # 14

The Boho metalanguage

Terminal definitions

Terminals are named in UPPER_CASE and can be described three ways:

PLUS: "+"               // string literal
NUMBER: /\d+(\.\d+)?/   // regular expression
STRING: @STR             // built-in description (@INT, @FLOAT, @STR)

Terminal descriptions can also be used directly (unnamed) in grammar rules -- they will be pruned from the syntax tree.

Prefixing a terminal name with _ (e.g. _WHITESPACE) prunes it from the tree despite being named.

Grammar rules

Nonterminals use lower_case names. Alternatives are separated with |:

value: NAME | NUMBER
assignment: NAME "=" value

EBNF extensions:

items: item+              // one or more
list: (item ",")*  item   // zero or more (with grouping)
optional: modifier?       // optional

Inline aliases with ->:

expr: term "+" term -> addition
    | term "-" term -> subtraction

Fake terminals

A name like COMMENT_ (uppercase ending with _) defines a fake terminal -- described like a nonterminal but collapsed into a single token. Useful for structures that regular expressions cannot describe (e.g. nested block comments).

Lexer modes

Following ANTLR's approach, a modal lexer is supported. Terminals before the first #mode belong to all modes.

LBRACE: "{" -> +inner    // push mode
RBRACE: "}" -> -         // pop mode

#inner
CONTENT: /[^{}]+/
Syntax Effect
-> +mode push mode onto stack
-> - pop one mode
-> -N pop N modes
-> -- clear the stack
-> mode replace top of stack

Ignoring tokens

%ignore " "
%ignore /\/\/[^\n]*/    // ignore line comments

Project structure

boho/
  __init__.py            # exports the Boho class
  boho.py                # main orchestrator
  lexer.py               # modal finite-automaton lexer
  lexer_generator.py     # terminal descriptions -> lexer DFA
  parser.py              # LR(1) shift-reduce parser
  parser_generator.py    # grammar -> LR(1) parse tables
  grammar_interpreter.py # interprets the Boho metalanguage
  interpreter.py         # base Interpreter class (visitor pattern)
  objects.py             # Token, Tree, LR1Item dataclasses
  regex.py               # regex-to-DFA via greenery
  grammars.py            # pre-compiled Boho grammar tables
docs/                    # English documentation
slo-dokumentacija/       # Slovenian documentation
examples/                # usage examples
tests/                   # test suite

How it works

  1. Your grammar string is parsed by Boho's own (bootstrapped) parser.
  2. Terminal descriptions are compiled into merged DFAs for a modal lexer.
  3. Grammar rules are compiled into LR(1) parse tables.
  4. At runtime, input text is tokenized by the lexer and then parsed into a Tree of Token leaves.

Boho is self-hosting -- its own metalanguage is specified in Boho (see examples/boho_in_boho.py).

API

Boho(grammar, log=False)

Create a parser from a grammar string. Set log=True to print the generated lexer and parser tables.

Boho.from_tables(lex_table, pars_table)

Create a parser from pre-computed lexer and parser tables. Skips the (potentially slow) grammar-to-tables compilation.

Boho.load(path)

Create a parser from a JSON file previously written by save().

boho.save(path)

Write the current [lex_table, pars_table] to a JSON file for fast reloading.

boho.tables

A [lex_table, pars_table] list — ready to pass back into from_tables or serialize yourself.

boho(text, log=False) -> Tree

Parse input text. Returns a Tree with Token leaves. Set log=True for step-by-step tracing.

Caching tables

Grammar compilation can be slow for larger grammars; the lexer/parser tables, on the other hand, load instantly. Compile once, persist, reuse:

from boho import Boho

# First run — compile and cache
Boho("""...grammar...""").save("tables.json")

# Subsequent runs — skip compilation
b = Boho.load("tables.json")
b("some input")

Tree

  • tree.name -- nonterminal name
  • tree.children -- list of Tree / Token children
  • tree.value -- concatenated text of all descendant tokens
  • tree.pretty() -- indented string representation
  • Supports iteration and indexing (tree[0], for child in tree)

Token

  • token.name -- terminal name
  • token.value -- matched text
  • token.line, token.col -- source location

Interpreter

Base class for tree walkers. Subclass it and define methods named after your nonterminals. The default behavior for unhandled nodes: tokens return their value, trees return a list of children's results.

Dependencies

  • greenery -- regex-to-FSM conversion
  • Python 3.10+ (uses match statements and X | Y type unions)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boho-0.2.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boho-0.2.0-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file boho-0.2.0.tar.gz.

File metadata

  • Download URL: boho-0.2.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for boho-0.2.0.tar.gz
Algorithm Hash digest
SHA256 67cbcbb9642095d1647e1c0780a2f32c852aa24871a5b9637f75fe28f2b4763f
MD5 14b83def2601f3610f16f47bb18cbf20
BLAKE2b-256 aa93b48c8a87e2a0979e4c46df19399b659a6223f46dabc5f4debac0af06b595

See more details on using hashes here.

File details

Details for the file boho-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: boho-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for boho-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 037005e8858947dffbce8cfeddaf2f57c2a594bb0ea8e11137df2022d59cdd5f
MD5 7273950665f3d778dbb01369c471aaca
BLAKE2b-256 3c1a47be92bb220a4f9236d578a4c1c8f0fd10d17c5e3aed5d6148ed6b7d021f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page