Skip to main content

A parsing library for Python

Project description

Esrapy — Parsing in Python

esrapy is yparse backwards

Esrapy is a parsing library written entirely in Python. It can parse pretty much any context-free grammar, including left-recursive ones; can compile patterns (grammars) from textual or procedural descriptions; unifies parsing and tokenizing so tokenization can be contextual; has support for precedence (ambiguity resolution) and attributes (limited context-sensitivity); can return "first match" or a forest of all possible parsings (for an ambiguous grammar); is very easy to use and results in very readable applications; and is only about 600 lines of code (not counting comments) — or only 450 if you don't need the textual compiler.

Esrapy works as a simple translator from a raw source text to a friendly parse tree, with emphasis on making that tree very easy to traverse.

Esrapy is not terribly fast and may be a bit of a memory hog while running. It's probably most useful for language prototyping or small interactive applications where generality and ease of use are more important than speed. The parsing method is somewhere between chart parsing and packrat recursive descent.

See also: pyparsing and spark.

Quick example

Here is a complete program using esrapy that implements a simple infix calculator with operator precedence:

import esrapy

# Here's a Pattern that parses expressions like "a = b*5+(2^c)":
pattern = esrapy.compile(r"""
SPACE = <[ \t]*>
expr  = name:<[a-zA-Z_]+> op:'=' b:expr
        |1; a:expr@1 op:<[+-]> b:expr@2
        |2; a:expr@2 op:<[*/]> b:expr@3
        |3; a:expr@3 op:'^'    b:expr@4
        |4; name:<[a-zA-Z_]+> op:\var
        |    val:<[0-9]+>     op:\int
        | '(' _:expr ')'
""")

vars = {}
def eval_pt(pt):
    if pt.op == 'int': return int(pt.val)
    if pt.op ==   '+': return eval_pt(pt.a)  + eval_pt(pt.b)
    if pt.op ==   '-': return eval_pt(pt.a)  - eval_pt(pt.b)
    if pt.op ==   '*': return eval_pt(pt.a)  * eval_pt(pt.b)
    if pt.op ==   '/': return eval_pt(pt.a)  / eval_pt(pt.b)
    if pt.op ==   '^': return eval_pt(pt.a) ** eval_pt(pt.b)
    if pt.op == 'var': return vars[pt.name]
    if pt.op ==   '=':
        val = eval_pt(pt.b)
        vars[pt.name] = val
        return val

while True:
    try:
        s = input("Expr: ")
    except EOFError:
        break
    try:
        pt = pattern.match(s, compact=True)
    except SyntaxError:
        print("Sorry, couldn't parse that.")
        continue
    print("Value:", eval_pt(pt))

See the examples/ directory for more, including calc.py which steps through the evolution of a pattern description, and maxicalc.py which demonstrates the visitor interface.

Getting started

See the examples directory for an implicit tutorial. calc.py in particular steps through the evolution of a Pattern description.

From within Python, you can also import esrapy and run help(esrapy) to get an overview. The sub-modules (esrapyCore, esrapyPatterns, etc.) also have help() content if you want to dig into the internals or use the procedural interface.

Pattern syntax

Patterns (grammars) are specified textually and compiled with:

pattern = esrapy.compile(textSpec)

Note textSpec is the definition of a pattern, not the text you will eventually parse with it. To compile and immediately parse:

parsed = esrapy.compile(textSpec).match(sourceText, compact=True)

The text spec is a series of definitions of the form name = pattern where name is any alphanumeric name (underscores ok), and pattern is any of those defined below. The name must be at the beginning of the line — no indentation is allowed. Indented lines are treated as continuations of the previous definition (no \ needed or allowed). Blank lines are ignored, as is anything after a #.

See examples/esrapy.pat for a sample pattern specification, which also happens to describe the syntax of patterns itself (i.e., esrapy.pat parses itself). See also calc.py and calc*.pat for a series of progressively richer ways of defining the same pattern.

A pattern can be one of:

  • A named pattern reference, as in foo, where foo is defined elsewhere in the pattern file.

  • A literal string enclosed in '' or "", as in "foo" or 'foo'.

  • A regular expression (see the re module) enclosed in <> or ` `, as in <[a-z]+> or `[a-z]+`. A group may be selected as the value with the @ operator, as in <un-([a-z]+)>@1. The default group (if no @ is given) is 0, i.e. the entire matched string.

  • A sequence of patterns, as in foo bar baz or 'foo' bar <[baz]>. The returned value is a simple array of the values of each pattern. Items may be labeled: a:foo b:bar c:<[baz]>. If only a single item is labeled and that label is _, that item is returned directly as the value of the sequence (passthrough) with all other items discarded. If multiple items are labeled (or a single item labeled with any name other than _), the return value is a MetaDict of those labeled items (see esrapyMisc.py), which exposes all entries as attributes so foo["bar"] can be written foo.bar. See minicalc.py for an example.

  • A set of alternates separated by |, as in foo | bar | baz. The returned value is the value of whichever pattern succeeds. Alternates are tried in order, so the first takes priority, but due to backtracking even the first match may not be the one ultimately selected. Items may be labeled using ; as in a;foo|bar|c;baz. If a labeled item is selected, the return value is a 2-tuple (label, value). An integer precedence may also be specified, applying to all successive items:

    set    = 1,a; foo | 2; bar | 3,c; baz | d; zorp
    subset = set@3    # Equivalent to: c; baz | d; zorp
    inUse  = "yo" set@2 <[a-z]+>
    

    This is most useful for binary operator precedence. See minicalc.py.

  • A repetition specified with *, +, ?, {n}, or {n,m}, where {n,m} means from n to m copies, {n} means n or more, and *, +, ? correspond to {0}, {1}, {0,1}. The return value is always an array. A separating pattern may be specified after a /, as in foo/bar+, which matches foo, foo bar foo, foo bar foo bar foo, etc. The separator is omitted from the return value. Useful for comma-separated lists: params = varName/","*.

  • A keyword inserted into the return stream with no effect on parsing, specified with a leading \, as in \key. For example:

    time = hour:int ":" minute:int | "noon" hour:\12 minute:\00
    
  • An indented block of equally-indented lines, specified as >> pattern, where pattern begins with non-whitespace and ends at end of line or end of file. Most useful for the top level of indentation-sensitive syntaxes.

  • A header line followed by an indented block, specified as head >> pattern. head matches something like the head of an if or while block. Note that the default SPACE pattern is incompatible with block/indent because it causes pattern to start with whitespace.

  • Grouping with (), as in foo (bar | baz) zorp.

Summary by example

foo bar baz    a sequence of patterns, returns a list
foo|bar|baz    a set of alternates, returns one item
foo*           zero or more, returns a list
foo+           one or more, returns a list
foo?           zero or one (optional), returns a list
foo{2,5}       2 to 5 copies, returns a list
foo/bar+       one or more foo's separated by bar; returns foos
foo/bar{5}     5 or more foo's separated by bar; returns foos
\foo           matches empty string, returns the literal "foo"
"foo"          a literal, returns itself
'foo'          ditto
<foo>          a regular expression, returns a string
`foo`          ditto
<f(oo)>@1      matches regex, returns the group "oo"
a:foo b:bar    sequence returning MetaDict {a:foo, b:bar}
_:foo bar      sequence returning only foo (passthrough)
a;foo|b;bar    set returning ("a", foo) or ("b", bar)
1;foo|2;bar    set annotated with precedence
foo@5          subset of foo with precedence >= 5
(foo|bar)baz   equivalent to: foo baz | bar baz
>> foo         a block of equally indented lines matching foo
foo >> bar     line matching foo followed by indented block of bar

Visitor interface

The traditional way to use esrapy is to call pattern.match() and work with the result — a nested structure of strings, lists, and MetaDicts. This works well for simple cases, but for non-trivial grammars a visitor is cleaner for building ASTs or evaluating parse trees.

Leave compact=False (the default) in match() to get Match objects directly, then walk the tree with a visitor whose methods are dispatched by pattern name:

m = pattern.match(sourceText)
result = m.visit(myVisitor)

visit() dispatches to a method named on_<patternname> on the visitor. If no matching on_* method exists, visit() applies default structural behavior for the matched pattern type and returns the result directly.

The parts parameter

If a handler declares a parameter named parts, visit() automatically calls the default structural decomposition and passes it as parts=, saving the handler from doing so explicitly. The type of parts depends on the pattern:

  • Sequence with multiple named children: parts is a MetaDict:
    def on_binop(self, m, parts):
        return parts.a + parts.b   # parts.a, parts.b already visited
    
  • Sequence with sole _ child: parts is the single visited child value.
  • Regex (terminal): parts is the matched string:
    def on_number(self, m, parts):
        return int(parts)
    
  • NtoM (repetition): parts is a list of visited children.
  • Block: parts is a list of visited statements.
  • Indent: parts is MetaDict(head=..., body=...).

In all cases, child matches within parts have already been visited recursively before the handler is called, so handlers are typically just one or two lines:

def on_assign(self, m, parts):
    vars[parts.name] = parts.val
    return parts.val

def on_binop(self, m, parts):
    op = parts.op
    if op == '+': return parts.a + parts.b
    if op == '-': return parts.a - parts.b
    if op == '*': return parts.a * parts.b
    if op == '/': return parts.a / parts.b
    if op == '^': return parts.a ** parts.b

def on_number(self, m, parts):
    return int(parts)

def on_var(self, m, parts):
    return vars[parts]

Omit parts when you need to control or defer the default recursion:

def on_with_stmt(self, m):
    self.context.push(...)
    parts = m.visit_parts(self)
    self.context.pop()
    return parts.body

Default behavior by pattern type

When no handler exists for a named pattern, or when visit_parts() is called explicitly:

Pattern type Default behavior
Regex Returns the matched string (m.value)
Sequence (sole _ child) Returns that child's visited value (passthrough)
Sequence (multiple named children) Returns MetaDict of visited named children; unnamed children discarded
Sequence (no named children) Returns list of all visited children
OneOfSet If the winning alternative has a string label, dispatches to that handler; otherwise recurses into the winning alternative
NtoM Returns list of visited children
Block Returns list of visited children
Indent Returns MetaDict(head=visited_head, body=visited_body)

Labeled alternatives and OneOfSet

When alternatives are labeled with string names in the grammar:

expr = 1,binop; a:expr@1 op:<[+-]> b:expr@2
     | 2,binop; a:expr@2 op:<[*/]> b:expr@3
     | var
     | number

visit() detects which labeled alternative matched and dispatches to on_binop, on_var, on_number etc. directly. A label can be shared across multiple alternatives — the same handler is called regardless of which precedence level matched. Numeric precedence and string label can be combined as N,name;.

Navigating children directly: m.children

For handlers that need to navigate the Match tree directly:

def on_binop(self, m):
    c = m.children
    a  = c.a.visit(self)
    op = c.op.value
    b  = c.b.visit(self)
    return ...

m.children returns a MatchView with attribute-style access to named children, recursing transparently through anonymous intermediate nodes. Use c.get('name', default) for optional children. Iterating m.children yields all named children in order.

m.value returns the raw first parsing value — for terminal Regex matches this is the matched string.

Tips for clean visitor dispatch

  1. Give every semantically meaningful alternative its own named rule or a string label within a set.
  2. Label all sequence items you need to access by name. Unlabeled items are treated as punctuation and discarded in the parts MetaDict.
  3. Use _ as the label for a sole passthrough item in a sequence: paren = '(' _:expr ')'
  4. For if/elif/else chains, fold the chain while handling the enclosing block rather than in a separate post-processing pass.

See examples/maxicalc.py for a complete working example of a visitor-based expression evaluator.

How it works

Esrapy is essentially a recursive-descent parser with two key additions:

  • Chart parsing: If esrapy finds itself looking for something it's already looking for (such as when a pattern is recursively composed of itself), it doesn't start looking again — it just remembers the context and notifies it if the pattern is later found. This handles left-recursive grammars.

  • Packrat parsing: If esrapy finds itself looking for something it has already found, it uses its cached result rather than re-parsing. This prevents exponential blowup on ambiguous grammars.

That's pretty much the whole of it.

Changes

Version 1.1 (2026-04-01)

  • Added custom param to the esrapy Pattern language compiler to allow injection of custom Pattern objects/classes into otherwise compiled Patterns.

Version 1.0 (2026-03-23)

  • Removed compact() code (default visitor provides ~identical functionality).
  • Defaulting compact=False in match() (*** API CHANGE ***)

Version 0.8.3 (2026-03-23)

  • Match.visit() now accepts None for the visitor, falling back to the same behavior as compact(all=False).

Version 0.8.2 (2026-03-21)

  • Misc tweaks.

Version 0.8.1 (2026-03-16)

  • Revised Sequence single-label passthrough: _ is now required to trigger passthrough (previously any single label triggered it).
  • Revised Match.visit() to apply default structural behavior per pattern type when no handler exists, rather than calling on_default().
  • Added Match.dispatch(), Match.visit_parts(), and visit_parts() to all Pattern subclasses.
  • Labeled OneOfSet alternatives now require an explicit handler.

Version 0.8 (2026-03-15)

  • Added visitor pattern support. Call pattern.match(compact=False) to get Match objects, then use m.visit(visitor) to dispatch to handler methods by pattern name.
  • Added Match.children (MatchView with attribute-style access to named children).
  • Added Match.value property for terminal match strings.
  • Extended OneOfSet alternative labels to support string names combinable with integer precedences as N,name;.

Version 0.7 (2023-03-20)

  • Added firstLine parameter to match()/parse() for better error messages.
  • Incorporated esrapyIndent for indentation-sensitive syntaxes; added >> operator for indented blocks.

Version 0.6.1 (2006-03-25)

  • Added catch of infinite repetitions of nil patterns in NtoM.
  • Syntax errors now raise SyntaxError instead of ValueError.

Version 0.6 (2006-03-16)

  • Added SPACE pattern for implicit whitespace.
  • Added skip expression to Regex Pattern.

Version 0.5 (2006-03-15)

  • Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esrapy-1.1.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

esrapy-1.1-py2.py3-none-any.whl (27.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file esrapy-1.1.tar.gz.

File metadata

  • Download URL: esrapy-1.1.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for esrapy-1.1.tar.gz
Algorithm Hash digest
SHA256 37517d7b41303a37701f02ce888cf75a50dfa3cdce62441221b3a48b562ab111
MD5 d6958cec6c0047248c4d82346d92ca7f
BLAKE2b-256 9a5c14919bb45c8a3aacce2c4823d5db1f8142dd489674ce59cb25c737b97ac7

See more details on using hashes here.

File details

Details for the file esrapy-1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: esrapy-1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 27.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for esrapy-1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 762460cb0870a6e2ba0da12c0f122721667b8a9e01082100088f2c4ae7078c4c
MD5 a8a12c7f2f933cf218de3cf787489e8e
BLAKE2b-256 b6208e4330b543c3d209384ebb389312d7f1d09d9507f5c9b03f1f3494e3e4b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page