Dynamic parsing library
Reason this release was yanked:
Beta release
Project description
CodeWhisper
CodeWhisper is yet another parsing library, designed for rapid development and easy maintenance of complex parsers. Languages with over a thousand rules and dozens of token types should parse thousands of strings in under a minute, without being impossible to understand or debug.
The design is heavily based on Jared Forsyth's CodeTalker, but has been rebuilt from the ground up in Cython for Python 3, with a slightly different feature set. CodeWhisper is not a drop-in replacement, but even complex grammars should be relatively simple to adapt.
CodeWhisper takes a three-step approach to processing a string: tokenization, parsing, and translation.
Tokenization
Tokenization is the process of breaking a string down into atomic chunks. Not every grammar works better with an explicit tokenization step, but for many grammars, it speeds up the parsing process.
CodeWhisper tokens are defined as subclasses of the codewhisper.tokens.Token
class, with a match
method that determines whether and how much of the input
can be part of that token class. For example:
class CHARACTER(Token):
r'''A Token that matches any single character.
'''#"""#'''
@staticmethod
def match(text, position):
return 1
The CHARACTER
class matches any single character. Because it is generally
useful for many grammars, it is included in the codewhisper.tokens
module.
The codewhisper.tokens.CharacterToken
class offers a fast and convenient way
to match a single character from a limited set:
class SYMBOL(CharacterToken):
characters = ',:[]{}'
class WHITESPACE(CharacterToken):
characters = ' \n\r\t'
When you need to match sets of strings with more than one character, try the
codewhisper.tokens.StringToken
class:
class CONSTANT(StringToken):
strings = [
'true',
'false',
'null',
]
The codewhisper.tokens
module also contains DelimitedToken
for finding
strings that start and end with specific characters, PatternToken
for
finding strings that match a regular expression, and KeywordToken
for
finding any of a list of strings when they aren't followed by a letter or
number.
For more complex tokens, custom Token subclasses can override the match
method:
class INT(Token):
@classmethod
def match(cls, text, position):
if text[position] == "0":
return 1
result = None
for length, c in enumerate(text[position:]):
if not c.isdigit():
break
result = length + 1
return result
By convention, Token subclasses meant to be used by a grammar are given
ALL_CAPS
names, whereas those meant to be used as base classes are named
with CapitalizedWords
. This, however, is merely to make things easier for
the developer; CodeWhisper itself doesn't care what you name anything.
Parsing
Parsing is the process of determining which tokens relate to each other, and what they might mean. This is often the slowest of the three steps, because it may have to consider multiple options before it finds the right way to interpret the input. These options are defined by rule functions, using a syntax reminiscent of Wirth syntax, written in real Python code:
def value(rule):
rule | STRING | CONSTANT | NUMBER
rule | json_object | json_list
def json_object(rule):
rule | ('{', [commas(STRING, ':', value)], '}')
def json_list(rule):
rule | ('[', [commas(value)], ']')
Essentially, each grammar rule is defined by a function with a single parameter. That parameter should be piped with one or more elements, where each elements may be:
- A token class, matching any token of that class;
- A string, matching any token with that exact string as its value;
- A grammar rule, matching if and only if the rule does;
- A tuple of elements, matching if each of its elements does, in that exact sequence;
- A list of elements, matching each of its elements, or without consuming any tokens;
- A set of elements, matching if any one of its elements matches; or
- A
codewhisper.parsing.Parser
instance, for more complex behavior.
Notably, elements may be computed when the rule function is run. For example,
the commas
function used above, defined in codewhisper.special
, is a
normal Python function that that happens to return a tuple. Since the rule
function is only run when it first gets used by a Grammar or another rule
function, it can refer to itself, even indirectly, or to rules defined later.
Each rule tries to match the token string against each of its options, in order. Like a parsing expression grammar (PEG), if a rule is ambiguous, the first available option is used if possible. Unlike a PEG, if that option doesn't allow an outer rule to parse the remaining tokens, subsequent options that match a different number of tokens will be attempted. This makes CodeWhisper slightly slower than a good PEG implementation for unambiguous grammars, but makes accidental ambiguity much easier to handle.
Special caveats apply to sets. They offer a convenient syntax for including a group of options inside a larger sequence, but they don't define an order, and they can't include lists or tuples of lists. If order is important, or a choice includes optional items, use either a subrule or a Parser instance. For example:
def json_float(rule):
integral = {"0", (Notahead("0"), Plus(DIGIT))}
fractional = (".", Plus(DIGIT))
exponential = ({"e", "E"}, [{"-", "+"}], Plus(DIGIT))
rule | (["-"], integral, Parser((fractional, [exponential]), exponential))
Here, sets are used for the integral
and exponential
portions of the rule,
but since lists are unhashable, (fractional, [exponential])
uses an explicit
Parser. That it happens to always check for a decimal point before checking
for an exponent is a minor bonus.
That example also showcases Notahead
and Plus
, which use Parser subclasses
to match any number of digits, in one case only when the first digit is
non-zero. They're found in the codetalker.special
module, along with
Avoid
, Consume
, Lookahead
, Seek
, Sequence
, and Star
.
By convention, rule functions are given lower_case
names, to distinguish
them from token classes. Parser subclasses and functions that transform
parser sequences have been given TitleCase
names, though the fact that they
get called within a rule function also helps distinguish them from rules and
tokens, which can be important for translation purposes.
Translation
Parse trees by themselves aren't very useful, so the Grammar class offers a way to translate them into what you really want. By default, a token will be translated into its string value, but that can be overridden:
class NUMBER(PatternToken):
pattern = r"(?:0|[1-9][0-9]*)(?:\.[0-9]+|)(?:[eE][-+]?[0-9]+|)"
def translate(self, scope):
return float(self.value)
Rules are by default translated into a list of translated tokens and subrules.
To get a subset of those, assign a list of rules and/or token classes to
rule.result
:
def json_list(rule):
rule | ('[', [{commas(value), WHITESPACE}], ']')
rule.result = [value]
A single rule or token class can be selected by assigning just that item. If no such item exists in the final parse, it will end up as None instead. A single one of multiple rules or tokens can be chosen by assigning a set:
def value(rule):
val = {CONSTANT, json_object, json_list, json_int, json_float, json_string}
rule | ([WHITESPACE], val, [WHITESPACE])
rule.result = val
Multiple such items can be combined into a dictionary by assigning one to
rule.result
with such a list, set, rule, or token class as each value:
def expression(rule):
rule | (term, Star("+", term), ["-", term])
rule.result = {
'operators': [SYMBOL],
'terms': [term],
}
For more interesting translations, the result can be passed to a translator function:
def json_int(rule):
rule | (["-"], "0")
rule | (["-"], Notahead("0"), Plus(DIGIT))
def int_translator(node, scope):
return int("".join(token.value for token in node))
rule.translator = int_translator
Note, however, that the tokens and subrules are not translated before being
passed to that function. It can translate them by using scope.translate
with or without extra information:
def json_object(rule):
rule | ('{', [commas(json_string, ':', value)], '}')
rule.result = {'keys': [json_string], 'values': [value]}
def object_translator(node, scope):
return {
scope.translate(key): scope.translate(value)
for key, value in zip(node['keys'], node['values'])
}
rule.translator = object_translator
Grammar
Each of those steps are governed by the Grammar class:
tokens = [SYMBOL, WHITESPACE, CONSTANT, NUMBER, STRING]
grammar = Grammar(start=value, tokens=tokens, ignore=[WHITESPACE])
result = grammar.parse(text)
Each grammar's start
rule is function defining the language that the text
must match in order to parse successfully. It and any other rules it uses
will be evaluated as part of the Grammar creation, so they must already exist.
Each grammar rule will be evaluated only once, even if it is used in multiple
grammars.
The list of tokens
will be used to tokenize the text. Each token class will
be checked in order; if two or more would match at the same position in the
text, the first one in the list takes precedence.
Any tokens in the optional ignore
list will be ignored by grammar rules,
essentially as if they didn't exist. Instead, they will be added to the
ignored
attribute of the following token. As a convenience,
codewhisper.tokens.join_tokens
will include the value of any ignored tokens
within the sequence passed to it.
The parse
function runs unicode text through tokenization, parsing, and
translation. It is not intended to accept bytestrings, but pull requests that
can make it do so will be considered.
Examples
The examples
directory contains a few working grammars, including two
different ways to parse JavaScript Object Notation. These do not, however,
showcase the full range of the parser, much less the special functions.
Indeed, this parser is probably at its best with the kind of large, complex
system that wouldn't fit well in a single file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.