Skip to main content

A self-hosting parser generator with a modal lexer and LR(1) parser

Project description

Boho

PyPI version Python

A self-hosting parser generator for Python. Define your grammar in a concise EBNF-based metalanguage and Boho will generate a modal lexer (DFA-based) and an LR(1) parser that produce a clean syntax tree.

Installation

pip install boho

Quick start

from boho import Boho
from boho.interpreter import Interpreter

grammar = '''
start: sum

sum: sum "+" prod
   | prod

prod: prod "*" INT
    | INT

INT: @INT

%ignore " "
'''

b = Boho(grammar)
tree = b("2 + 3 * 4")
print(tree.pretty())

Output:

start:
  sum:
    sum:
      prod:
        'INT' '2'
    prod:
      prod:
        'INT' '3'
      'INT' '4'

Writing an interpreter

Subclass Interpreter and define methods matching your nonterminal names:

class Calc(Interpreter):
    def start(self, tree):
        return self(tree[0])

    def sum(self, tree):
        return sum(self(c) for c in tree)

    def prod(self, tree):
        result = int(self(tree[0]))
        for i in range(1, len(tree)):
            result *= int(self(tree[i]))
        return result


calc = Calc()
print(calc(tree))  # 14

The Boho metalanguage

Terminal definitions

Terminals are named in UPPER_CASE and can be described three ways:

PLUS: "+"               // string literal
NUMBER: /\d+(\.\d+)?/   // regular expression
STRING: @STR             // built-in description (@INT, @FLOAT, @STR)

Terminal descriptions can also be used directly (unnamed) in grammar rules -- they will be pruned from the syntax tree.

Prefixing a terminal name with _ (e.g. _WHITESPACE) prunes it from the tree despite being named.

Grammar rules

Nonterminals use lower_case names. Alternatives are separated with |:

value: NAME | NUMBER
assignment: NAME "=" value

EBNF extensions:

items: item+              // one or more
list: (item ",")*  item   // zero or more (with grouping)
optional: modifier?       // optional

Inline aliases with ->:

expr: term "+" term -> addition
    | term "-" term -> subtraction

Fake terminals

A name like COMMENT_ (uppercase ending with _) defines a fake terminal -- described like a nonterminal but collapsed into a single token. Useful for structures that regular expressions cannot describe (e.g. nested block comments).

Lexer modes

Following ANTLR's approach, a modal lexer is supported. Terminals before the first #mode belong to all modes.

LBRACE: "{" -> +inner    // push mode
RBRACE: "}" -> -         // pop mode

#inner
CONTENT: /[^{}]+/
Syntax Effect
-> +mode push mode onto stack
-> - pop one mode
-> -N pop N modes
-> -- clear the stack
-> mode replace top of stack

Ignoring tokens

%ignore " "
%ignore /\/\/[^\n]*/    // ignore line comments

Project structure

boho/
  __init__.py            # exports the Boho class
  boho.py                # main orchestrator
  lexer.py               # modal finite-automaton lexer
  lexer_generator.py     # terminal descriptions -> lexer DFA
  parser.py              # LR(1) shift-reduce parser
  parser_generator.py    # grammar -> LR(1) parse tables
  grammar_interpreter.py # interprets the Boho metalanguage
  interpreter.py         # base Interpreter class (visitor pattern)
  objects.py             # Token, Tree, LR1Item dataclasses
  regex.py               # regex-to-DFA via greenery
  grammars.py            # pre-compiled Boho grammar tables
docs/                    # English documentation
slo-dokumentacija/       # Slovenian documentation
examples/                # usage examples
tests/                   # test suite

How it works

  1. Your grammar string is parsed by Boho's own (bootstrapped) parser.
  2. Terminal descriptions are compiled into merged DFAs for a modal lexer.
  3. Grammar rules are compiled into LR(1) parse tables.
  4. At runtime, input text is tokenized by the lexer and then parsed into a Tree of Token leaves.

Boho is self-hosting -- its own metalanguage is specified in Boho (see examples/boho_in_boho.py).

API

Boho(grammar, log=False)

Create a parser from a grammar string. Set log=True to print the generated lexer and parser tables.

Boho.from_tables(lex_table, pars_table)

Create a parser from pre-computed lexer and parser tables. Skips the (potentially slow) grammar-to-tables compilation.

Boho.load(path)

Create a parser from a JSON file previously written by save().

boho.save(path)

Write the current [lex_table, pars_table] to a JSON file for fast reloading.

boho.tables

A [lex_table, pars_table] list — ready to pass back into from_tables or serialize yourself.

boho(text, log=False) -> Tree

Parse input text. Returns a Tree with Token leaves. Set log=True for step-by-step tracing.

Caching tables

Grammar compilation can be slow for larger grammars; the lexer/parser tables, on the other hand, load instantly. Compile once, persist, reuse:

from boho import Boho

# First run — compile and cache
Boho("""...grammar...""").save("tables.json")

# Subsequent runs — skip compilation
b = Boho.load("tables.json")
b("some input")

Tree

  • tree.name -- nonterminal name
  • tree.children -- list of Tree / Token children
  • tree.value -- concatenated text of all descendant tokens
  • tree.pretty() -- indented string representation
  • Supports iteration and indexing (tree[0], for child in tree)

Token

  • token.name -- terminal name
  • token.value -- matched text
  • token.line, token.col -- source location

Interpreter

Base class for tree walkers. Subclass it and define methods named after your nonterminals. The default behavior for unhandled nodes: tokens return their value, trees return a list of children's results.

Dependencies

  • greenery -- regex-to-FSM conversion
  • Python 3.10+ (uses match statements and X | Y type unions)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boho-0.2.2.tar.gz (37.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boho-0.2.2-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file boho-0.2.2.tar.gz.

File metadata

  • Download URL: boho-0.2.2.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for boho-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7870be6e87a62d00fd944e04430269b2710884732050d59916b8e30c585231e0
MD5 e9251b36a6387c24e3dd5d126894fb71
BLAKE2b-256 733c748f818b0f5d8b2479b3bd9384895630f9757877103c880f269a91553584

See more details on using hashes here.

File details

Details for the file boho-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: boho-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for boho-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 de921097ada784455e6995b1950d36644d54917f346d27f1752d800926a569b5
MD5 c8048d5ff866ee5c4e5f0544604a8478
BLAKE2b-256 14b1a918d90bb2c8af75076f06a216fb05cfb1b25138bb5245981e8bc987328a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page