A parsing library for Python
Project description
Esrapy — Parsing in Python
esrapy is yparse backwards
Esrapy is a parsing library written entirely in Python. It can parse pretty much any context-free grammar, including left-recursive ones; can compile patterns (grammars) from textual or procedural descriptions; unifies parsing and tokenizing so tokenization can be contextual; has support for precedence (ambiguity resolution) and attributes (limited context-sensitivity); can return "first match" or a forest of all possible parsings (for an ambiguous grammar); is very easy to use and results in very readable applications; and is only about 600 lines of code (not counting comments) — or only 450 if you don't need the textual compiler.
Esrapy works as a simple translator from a raw source text to a friendly parse tree, with emphasis on making that tree very easy to traverse.
Esrapy is not terribly fast and may be a bit of a memory hog while running. It's probably most useful for language prototyping or small interactive applications where generality and ease of use are more important than speed. The parsing method is somewhere between chart parsing and packrat recursive descent.
See also: pyparsing and spark.
Quick example
Here is a complete program using esrapy that implements a simple infix calculator with operator precedence:
import esrapy
# Here's a Pattern that parses expressions like "a = b*5+(2^c)":
pattern = esrapy.compile(r"""
SPACE = <[ \t]*>
expr = name:<[a-zA-Z_]+> op:'=' b:expr
|1; a:expr@1 op:<[+-]> b:expr@2
|2; a:expr@2 op:<[*/]> b:expr@3
|3; a:expr@3 op:'^' b:expr@4
|4; name:<[a-zA-Z_]+> op:\var
| val:<[0-9]+> op:\int
| '(' _:expr ')'
""")
vars = {}
def eval_pt(pt):
if pt.op == 'int': return int(pt.val)
if pt.op == '+': return eval_pt(pt.a) + eval_pt(pt.b)
if pt.op == '-': return eval_pt(pt.a) - eval_pt(pt.b)
if pt.op == '*': return eval_pt(pt.a) * eval_pt(pt.b)
if pt.op == '/': return eval_pt(pt.a) / eval_pt(pt.b)
if pt.op == '^': return eval_pt(pt.a) ** eval_pt(pt.b)
if pt.op == 'var': return vars[pt.name]
if pt.op == '=':
val = eval_pt(pt.b)
vars[pt.name] = val
return val
while True:
try:
s = input("Expr: ")
except EOFError:
break
try:
pt = pattern.match(s, compact=True)
except SyntaxError:
print("Sorry, couldn't parse that.")
continue
print("Value:", eval_pt(pt))
See the examples/ directory for more, including calc.py which steps
through the evolution of a pattern description, and maxicalc.py which
demonstrates the visitor interface.
Getting started
See the examples directory for an implicit tutorial. calc.py in particular
steps through the evolution of a Pattern description.
From within Python, you can also import esrapy and run help(esrapy) to
get an overview. The sub-modules (esrapyCore, esrapyPatterns, etc.) also
have help() content if you want to dig into the internals or use the
procedural interface.
Pattern syntax
Patterns (grammars) are specified textually and compiled with:
pattern = esrapy.compile(textSpec)
Note textSpec is the definition of a pattern, not the text you will
eventually parse with it. To compile and immediately parse:
parsed = esrapy.compile(textSpec).match(sourceText, compact=True)
The text spec is a series of definitions of the form name = pattern where
name is any alphanumeric name (underscores ok), and pattern is any of
those defined below. The name must be at the beginning of the line — no
indentation is allowed. Indented lines are treated as continuations of the
previous definition (no \ needed or allowed). Blank lines are ignored, as
is anything after a #.
See examples/esrapy.pat for a sample pattern specification, which also
happens to describe the syntax of patterns itself (i.e., esrapy.pat parses
itself). See also calc.py and calc*.pat for a series of progressively
richer ways of defining the same pattern.
A pattern can be one of:
-
A named pattern reference, as in
foo, wherefoois defined elsewhere in the pattern file. -
A literal string enclosed in
''or"", as in"foo"or'foo'. -
A regular expression (see the
remodule) enclosed in<>or` `, as in<[a-z]+>or`[a-z]+`. A group may be selected as the value with the@operator, as in<un-([a-z]+)>@1. The default group (if no@is given) is 0, i.e. the entire matched string. -
A sequence of patterns, as in
foo bar bazor'foo' bar <[baz]>. The returned value is a simple array of the values of each pattern. Items may be labeled:a:foo b:bar c:<[baz]>. If only a single item is labeled and that label is_, that item is returned directly as the value of the sequence (passthrough) with all other items discarded. If multiple items are labeled (or a single item labeled with any name other than_), the return value is aMetaDictof those labeled items (seeesrapyMisc.py), which exposes all entries as attributes sofoo["bar"]can be writtenfoo.bar. Seeminicalc.pyfor an example. -
A set of alternates separated by
|, as infoo | bar | baz. The returned value is the value of whichever pattern succeeds. Alternates are tried in order, so the first takes priority, but due to backtracking even the first match may not be the one ultimately selected. Items may be labeled using;as ina;foo|bar|c;baz. If a labeled item is selected, the return value is a 2-tuple(label, value). An integer precedence may also be specified, applying to all successive items:set = 1,a; foo | 2; bar | 3,c; baz | d; zorp subset = set@3 # Equivalent to: c; baz | d; zorp inUse = "yo" set@2 <[a-z]+>This is most useful for binary operator precedence. See
minicalc.py. -
A repetition specified with
*,+,?,{n}, or{n,m}, where{n,m}means from n to m copies,{n}means n or more, and*,+,?correspond to{0},{1},{0,1}. The return value is always an array. A separating pattern may be specified after a/, as infoo/bar+, which matchesfoo,foo bar foo,foo bar foo bar foo, etc. The separator is omitted from the return value. Useful for comma-separated lists:params = varName/","*. -
A keyword inserted into the return stream with no effect on parsing, specified with a leading
\, as in\key. For example:time = hour:int ":" minute:int | "noon" hour:\12 minute:\00 -
An indented block of equally-indented lines, specified as
>> pattern, wherepatternbegins with non-whitespace and ends at end of line or end of file. Most useful for the top level of indentation-sensitive syntaxes. -
A header line followed by an indented block, specified as
head >> pattern.headmatches something like the head of aniforwhileblock. Note that the defaultSPACEpattern is incompatible with block/indent because it causespatternto start with whitespace. -
Grouping with
(), as infoo (bar | baz) zorp.
Summary by example
foo bar baz a sequence of patterns, returns a list
foo|bar|baz a set of alternates, returns one item
foo* zero or more, returns a list
foo+ one or more, returns a list
foo? zero or one (optional), returns a list
foo{2,5} 2 to 5 copies, returns a list
foo/bar+ one or more foo's separated by bar; returns foos
foo/bar{5} 5 or more foo's separated by bar; returns foos
\foo matches empty string, returns the literal "foo"
"foo" a literal, returns itself
'foo' ditto
<foo> a regular expression, returns a string
`foo` ditto
<f(oo)>@1 matches regex, returns the group "oo"
a:foo b:bar sequence returning MetaDict {a:foo, b:bar}
_:foo bar sequence returning only foo (passthrough)
a;foo|b;bar set returning ("a", foo) or ("b", bar)
1;foo|2;bar set annotated with precedence
foo@5 subset of foo with precedence >= 5
(foo|bar)baz equivalent to: foo baz | bar baz
>> foo a block of equally indented lines matching foo
foo >> bar line matching foo followed by indented block of bar
Visitor interface
The traditional way to use esrapy is to call pattern.match() and work
with the result — a nested structure of strings, lists, and MetaDicts.
This works well for simple cases, but for non-trivial grammars a visitor
is cleaner for building ASTs or evaluating parse trees.
Leave compact=False (the default) in match() to get Match objects
directly, then walk the tree with a visitor whose methods are dispatched
by pattern name:
m = pattern.match(sourceText)
result = m.visit(myVisitor)
visit() dispatches to a method named on_<patternname> on the visitor.
If no matching on_* method exists, visit() applies default structural
behavior for the matched pattern type and returns the result directly.
The parts parameter
If a handler declares a parameter named parts, visit() automatically
calls the default structural decomposition and passes it as parts=,
saving the handler from doing so explicitly. The type of parts depends
on the pattern:
- Sequence with multiple named children:
partsis aMetaDict:def on_binop(self, m, parts): return parts.a + parts.b # parts.a, parts.b already visited
- Sequence with sole
_child:partsis the single visited child value. - Regex (terminal):
partsis the matched string:def on_number(self, m, parts): return int(parts)
- NtoM (repetition):
partsis a list of visited children. - Block:
partsis a list of visited statements. - Indent:
partsisMetaDict(head=..., body=...).
In all cases, child matches within parts have already been visited
recursively before the handler is called, so handlers are typically
just one or two lines:
def on_assign(self, m, parts):
vars[parts.name] = parts.val
return parts.val
def on_binop(self, m, parts):
op = parts.op
if op == '+': return parts.a + parts.b
if op == '-': return parts.a - parts.b
if op == '*': return parts.a * parts.b
if op == '/': return parts.a / parts.b
if op == '^': return parts.a ** parts.b
def on_number(self, m, parts):
return int(parts)
def on_var(self, m, parts):
return vars[parts]
Omit parts when you need to control or defer the default recursion:
def on_with_stmt(self, m):
self.context.push(...)
parts = m.visit_parts(self)
self.context.pop()
return parts.body
Default behavior by pattern type
When no handler exists for a named pattern, or when visit_parts() is
called explicitly:
| Pattern type | Default behavior |
|---|---|
| Regex | Returns the matched string (m.value) |
Sequence (sole _ child) |
Returns that child's visited value (passthrough) |
| Sequence (multiple named children) | Returns MetaDict of visited named children; unnamed children discarded |
| Sequence (no named children) | Returns list of all visited children |
| OneOfSet | If the winning alternative has a string label, dispatches to that handler; otherwise recurses into the winning alternative |
| NtoM | Returns list of visited children |
| Block | Returns list of visited children |
| Indent | Returns MetaDict(head=visited_head, body=visited_body) |
Labeled alternatives and OneOfSet
When alternatives are labeled with string names in the grammar:
expr = 1,binop; a:expr@1 op:<[+-]> b:expr@2
| 2,binop; a:expr@2 op:<[*/]> b:expr@3
| var
| number
visit() detects which labeled alternative matched and dispatches to
on_binop, on_var, on_number etc. directly. A label can be shared
across multiple alternatives — the same handler is called regardless of
which precedence level matched. Numeric precedence and string label can be
combined as N,name;.
Navigating children directly: m.children
For handlers that need to navigate the Match tree directly:
def on_binop(self, m):
c = m.children
a = c.a.visit(self)
op = c.op.value
b = c.b.visit(self)
return ...
m.children returns a MatchView with attribute-style access to named
children, recursing transparently through anonymous intermediate nodes.
Use c.get('name', default) for optional children. Iterating m.children
yields all named children in order.
m.value returns the raw first parsing value — for terminal Regex matches
this is the matched string.
Tips for clean visitor dispatch
- Give every semantically meaningful alternative its own named rule or a string label within a set.
- Label all sequence items you need to access by name. Unlabeled items are
treated as punctuation and discarded in the
partsMetaDict. - Use
_as the label for a sole passthrough item in a sequence:paren = '(' _:expr ')' - For if/elif/else chains, fold the chain while handling the enclosing block rather than in a separate post-processing pass.
See examples/maxicalc.py for a complete working example of a visitor-based
expression evaluator.
How it works
Esrapy is essentially a recursive-descent parser with two key additions:
-
Chart parsing: If esrapy finds itself looking for something it's already looking for (such as when a pattern is recursively composed of itself), it doesn't start looking again — it just remembers the context and notifies it if the pattern is later found. This handles left-recursive grammars.
-
Packrat parsing: If esrapy finds itself looking for something it has already found, it uses its cached result rather than re-parsing. This prevents exponential blowup on ambiguous grammars.
That's pretty much the whole of it.
Changes
Version 1.2 (2026-05-11)
- (No change; bump version to sync better with pypi)
Version 1.1 (2026-04-01)
- Added
customparam to the esrapy Pattern language compiler to allow injection of custom Pattern objects/classes into otherwise compiled Patterns.
Version 1.0 (2026-03-23)
- Removed
compact()code (default visitor provides ~identical functionality). - Defaulting
compact=Falseinmatch()(*** API CHANGE ***)
Version 0.8.3 (2026-03-23)
Match.visit()now acceptsNonefor the visitor, falling back to the same behavior ascompact(all=False).
Version 0.8.2 (2026-03-21)
- Misc tweaks.
Version 0.8.1 (2026-03-16)
- Revised Sequence single-label passthrough:
_is now required to trigger passthrough (previously any single label triggered it). - Revised
Match.visit()to apply default structural behavior per pattern type when no handler exists, rather than callingon_default(). - Added
Match.dispatch(),Match.visit_parts(), andvisit_parts()to all Pattern subclasses. - Labeled
OneOfSetalternatives now require an explicit handler.
Version 0.8 (2026-03-15)
- Added visitor pattern support. Call
pattern.match(compact=False)to getMatchobjects, then usem.visit(visitor)to dispatch to handler methods by pattern name. - Added
Match.children(MatchViewwith attribute-style access to named children). - Added
Match.valueproperty for terminal match strings. - Extended
OneOfSetalternative labels to support string names combinable with integer precedences asN,name;.
Version 0.7 (2023-03-20)
- Added
firstLineparameter tomatch()/parse()for better error messages. - Incorporated
esrapyIndentfor indentation-sensitive syntaxes; added>>operator for indented blocks.
Version 0.6.1 (2006-03-25)
- Added catch of infinite repetitions of nil patterns in
NtoM. - Syntax errors now raise
SyntaxErrorinstead ofValueError.
Version 0.6 (2006-03-16)
- Added
SPACEpattern for implicit whitespace. - Added skip expression to
RegexPattern.
Version 0.5 (2006-03-15)
- Initial release.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file esrapy-1.2.tar.gz.
File metadata
- Download URL: esrapy-1.2.tar.gz
- Upload date:
- Size: 29.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bb97e6feece7ef257d773c9b5f4f5fb2b1b11133bfc7717b476b2b38522016a
|
|
| MD5 |
b58911ae192025b46d7eab063e418170
|
|
| BLAKE2b-256 |
e4f085603a9082133968f9833f2430fd3e733bc9d69e1486d47629f29e7eea9c
|
File details
Details for the file esrapy-1.2-py2.py3-none-any.whl.
File metadata
- Download URL: esrapy-1.2-py2.py3-none-any.whl
- Upload date:
- Size: 27.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3971226ed74f98073e9c803eedb639e897a459ebc44e18077d42053939dd147f
|
|
| MD5 |
ba8385f6f8fc07494abdb16afefcc5a8
|
|
| BLAKE2b-256 |
ad5c73facaf9b8cf173fa84e326c42d72ef1ca6d83770191443a6eee5a8e023a
|