A self-hosting parser generator with a modal lexer and LR(1) parser
Project description
Boho
A self-hosting parser generator for Python. Define your grammar in a concise EBNF-based metalanguage and Boho will generate a modal lexer (DFA-based) and an LR(1) parser that produce a clean syntax tree.
Installation
pip install boho
Quick start
from boho import Boho
from boho.interpreter import Interpreter
grammar = '''
start: sum
sum: sum "+" prod
| prod
prod: prod "*" INT
| INT
INT: @INT
%ignore " "
'''
b = Boho(grammar)
tree = b("2 + 3 * 4")
print(tree.pretty())
Output:
start:
sum:
sum:
prod:
'INT' '2'
prod:
prod:
'INT' '3'
'INT' '4'
Writing an interpreter
Subclass Interpreter and define methods matching your nonterminal names:
class Calc(Interpreter):
def start(self, tree):
return self(tree[0])
def sum(self, tree):
return sum(self(c) for c in tree)
def prod(self, tree):
result = int(self(tree[0]))
for i in range(1, len(tree)):
result *= int(self(tree[i]))
return result
calc = Calc()
print(calc(tree)) # 14
The Boho metalanguage
Terminal definitions
Terminals are named in UPPER_CASE and can be described three ways:
PLUS: "+" // string literal
NUMBER: /\d+(\.\d+)?/ // regular expression
STRING: @STR // built-in description (@INT, @FLOAT, @STR)
Terminal descriptions can also be used directly (unnamed) in grammar rules -- they will be pruned from the syntax tree.
Prefixing a terminal name with _ (e.g. _WHITESPACE) prunes it from the tree despite being named.
Grammar rules
Nonterminals use lower_case names. Alternatives are separated with |:
value: NAME | NUMBER
assignment: NAME "=" value
EBNF extensions:
items: item+ // one or more
list: (item ",")* item // zero or more (with grouping)
optional: modifier? // optional
Inline aliases with ->:
expr: term "+" term -> addition
| term "-" term -> subtraction
Fake terminals
A name like COMMENT_ (uppercase ending with _) defines a fake terminal -- described like a nonterminal but collapsed
into a single token. Useful for structures that regular expressions cannot describe (e.g. nested block comments).
Lexer modes
Following ANTLR's approach, a modal lexer is supported. Terminals before the first #mode belong to all modes.
LBRACE: "{" -> +inner // push mode
RBRACE: "}" -> - // pop mode
#inner
CONTENT: /[^{}]+/
| Syntax | Effect |
|---|---|
-> +mode |
push mode onto stack |
-> - |
pop one mode |
-> -N |
pop N modes |
-> -- |
clear the stack |
-> mode |
replace top of stack |
Ignoring tokens
%ignore " "
%ignore /\/\/[^\n]*/ // ignore line comments
Project structure
boho/
__init__.py # exports the Boho class
boho.py # main orchestrator
lexer.py # modal finite-automaton lexer
lexer_generator.py # terminal descriptions -> lexer DFA
parser.py # LR(1) shift-reduce parser
parser_generator.py # grammar -> LR(1) parse tables
grammar_interpreter.py # interprets the Boho metalanguage
interpreter.py # base Interpreter class (visitor pattern)
objects.py # Token, Tree, LR1Item dataclasses
regex.py # regex-to-DFA via greenery
grammars.py # pre-compiled Boho grammar tables
docs/ # English documentation
slo-dokumentacija/ # Slovenian documentation
examples/ # usage examples
tests/ # test suite
How it works
- Your grammar string is parsed by Boho's own (bootstrapped) parser.
- Terminal descriptions are compiled into merged DFAs for a modal lexer.
- Grammar rules are compiled into LR(1) parse tables.
- At runtime, input text is tokenized by the lexer and then parsed into a
TreeofTokenleaves.
Boho is self-hosting -- its own metalanguage is specified in Boho (see examples/boho_in_boho.py).
API
Boho(grammar, log=False)
Create a parser from a grammar string. Set log=True to print the generated lexer and parser tables.
boho(text, log=False) -> Tree
Parse input text. Returns a Tree with Token leaves. Set log=True for step-by-step tracing.
Tree
tree.name-- nonterminal nametree.children-- list ofTree/Tokenchildrentree.value-- concatenated text of all descendant tokenstree.pretty()-- indented string representation- Supports iteration and indexing (
tree[0],for child in tree)
Token
token.name-- terminal nametoken.value-- matched texttoken.line,token.col-- source location
Interpreter
Base class for tree walkers. Subclass it and define methods named after your nonterminals. The default behavior for unhandled nodes: tokens return their value, trees return a list of children's results.
Dependencies
- greenery -- regex-to-FSM conversion
- Python 3.10+ (uses
matchstatements andX | Ytype unions)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file boho-0.1.1.tar.gz.
File metadata
- Download URL: boho-0.1.1.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
349499e182f71d7bcddfc68d514ba611340b04e9d6b30149283f733dea82ed84
|
|
| MD5 |
b6bdbf71b22986444eb21aedf36c16e7
|
|
| BLAKE2b-256 |
61eb4a4f49db4c122cf67d4210839d32a74847cfd36fd7c76934afdda2a490ca
|
File details
Details for the file boho-0.1.1-py3-none-any.whl.
File metadata
- Download URL: boho-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7ebf87a99737e4c164689bc81c2f7112b78ed47c37ff6e9bb54c69123eed539
|
|
| MD5 |
1d5e5d781112742c262a7334f96ef501
|
|
| BLAKE2b-256 |
3a43b92e5ee464450226e53ebcb0c438c0daa78cd34679b81183e9465d16f734
|