A python parser that builds python ASTs in 502 lines of python without using modules
Project description
pymetaterp
This is a python AST builder that uses no Python modules. A longer stackless version is available for easier porting. single_file.py
is a stand-alone 502 lines script.
Its (also) just another parsing expression grammar (PEG) based parser with one major difference. The parsed grammar is interpreted instead of compiled. This makes it easy to modify the language (by editing its grammar) as well as the language that grammar is written in (and the language of that grammar).
[This is a pre-release of sorts. There are probably some errors and missing information.]
Download and run
git clone https://github.com/asrp/pymetaterp
cd pymetaterp
python single_file.py
or
python single_file.py filename.py
This will print out the AST of the given file (or single_file.py
's own AST). Sample beginning of the output:
file_input
regular_assign
testlist
NAME
str 'NAME'
NAME
str 'FLAGS'
To run files from the library
python test/boot_test.py
python test/python_parse_test.py
Files
single_file.py
is mainly for demonstration. This module is otherwise separated into files. There are many files but they are mostly separate. The import dependencies is
util.py
boot.py
boot_stackless.py
python.py
Other files have no imports. To get something useful, you'll have to import multiple files. See test/python_parse_test.py
and test/boot_test.py
for some examples.
Repl
An obvious thing missing is the grammar read-eval-print loop (repl) so the interpreter can be fed one rule at a time, parsing subsequence input using the rules seen so far.
Source reading order
I'd suggest reading boot.py
and bootstrap
in boot_grammar.py
first. The two form the core and together with boot_tree.py
, they can regenerate boot_tree
.
Then boot_stackless
is the same as boot.py
but doesn't use the Python call stack/recursion for parsing.
python.py
adds functionality to the boot.py
interpreter. diff
in boot_grammar.py
adds the syntax for those.
Finally, python_grammar.py
contains the python grammar to be finally parsed.
Python language parsed
The module builds the AST for Python 2.x programs. It is able to parse all of Python 2.x (in fact, it contains a slightly modified version of the Python 2.x grammar) but is less lenient with whitespaces. For example, parsing
from my_module import (var1, var2,
var3, var4)
gives an error.
Since this is a pre-release, there are likely bugs with parts of the language I don't use so often. It can build the AST for all files included here.
Gramamr language differences
The beginning of boot_grammar.py
self-describes the grammar. Its a PEG so all "or" (|
) returns the first match and "and" and "quantified" (*, +, ?
) are greedy.
name = (letter | '_') (letter | digit | '_')*
expr = apply | exactly | token | parenthesis | output
exactly! = "'" {(escaped_char | ~'\'' anything)*} "'"
token! = "\"" {(escaped_char | ~'"' anything)*} "\""
escaped_char! = '\\' {'n'|'r'|'t'|'b'|'f'|'"'|'\''|'\\'}
apply! = ('\t'|' ')* {name}
parenthesis = "(" {or} ")"
output! = "{" {or} "}"
not = "~" {expr=negation} | expr
quantified = not (('*' | '+' | '?')=quantifier)?
bound = quantified ('=' {name=inline})?
and = bound*
or = and ("|" {and})*
rule = spaces {name=rule_name '!'?=flags and=args ("=" {or})}
grammar = {rule*} spaces
The main difference from other PEG.
- output rule:
a {b c} d
will match the concatenation ofa b c d
but only return what matchedb c
. - quantifier collapse:
letter letter*
returns a list rather than a pair with the second element being a list matchingletter*
. - nested and collapse:
a (b (c d)) e
has the same output asa b c d e
(see inline below if some pairs need to be explicitly grouped). - node collapsing: nodes in the output with only one child are replaced by their parent, unless the
!
("don't collapse") flag is set for that node. - inline: shamelessly taken Ohm but with a slightly different interpretation.
expression=name
creates a node namedname
wrapping the output ofexpression
. - rule replacement: having a second
rule_name = expression
line replaces the first definition ofrule_name
(instead of appending into an or). - two basic tokens: there are two basic token types:
'a'
(single quote) and"a"
(double quote). The double quoted token allows whitespace before matching.
Regenerating boot_tree.py
Create some tree match_tree
using Interpreter.match
and call save
on the result.
match_tree.save("tree.py")
Left recursion
While "PEG/packrat parsers can support left-recursion", the tree output isn't the one we want. The python functions reformat_binary
and reformat_atom
fixes a parsed tree's ouput.
Source oddities
Two hard-coded rules
if root[NAME] == "anything":
return pop(self.input)
elif root[NAME] == "void":
return
Hard-coded semantics for tokens
if name == "token":
while pop(self.input) in self.whitespace:
if self.input[0][self.input[1]] == '\\':
pop(self.input)
self.input[1] -= 1
if name == "token" and root[0].isalpha():
top = pop(self.input)
if top.isalnum() or top == '_':
raise MatchError("Prefix matched but didn't end.")
self.input[1] -= 1
Optimization
Some effort were made to make these files short (especially single_file.py
) but not too much. There are still some asserts around and commented print statements that can be useful for debugging. The final goal is, of course, to reduce the program's complexity and verbosity, not its line count.
Missing features
Features/bloat from a longer version of this program not (yet?) moved over:
- Debugging tree of nodes visited and their input and output
- Function arguments (its in the grammar but not the interpreter)
- Nested list inputs (its also in the grammar but not the interpreter)
- name, args, flags, body as parameters instead of positional children
MemoizationMatched input start and end positions- Exact python expression matching for predicate, action and rule value.
balanced
is used as a simpler heuristic for now.
Removing features
To get a smaller file with just the basics.
patch -R pymetaterp/python.py < patches/python_pos.patch
patch -R pymetaterp/python.py < patches/python_memoizer.patch
patch -R pymetaterp/boot_stackless.py < patches/boot_pos.patch
patch -R pymetaterp/boot_stackless.py < patches/boot_memoizer.patch
Readings
- Ometa - Warth's thesis reads very well.
- PEG and packrat parser
- Packrat Parsers Can Support Left Recursion
Other similar projects
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pymetaterp-1.1.tar.gz
.
File metadata
- Download URL: pymetaterp-1.1.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bf783acb81b735367e33b79c7fe94d6babb7325621cc2cf43d4429c0ce3673b |
|
MD5 | b2c2f8aeacca24b6f4d9c5055c152ad1 |
|
BLAKE2b-256 | 19776456a03deb466bd253db3fe3982fbb53e54cb7736457744fd80620169b93 |
File details
Details for the file pymetaterp-1.1-py2-none-any.whl
.
File metadata
- Download URL: pymetaterp-1.1-py2-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3368cd8812adf7cfec0dcab54d334f2cd452ba090f39729e14a6016e800ef0d3 |
|
MD5 | 959edae9e9bf36216a550e0c2930de77 |
|
BLAKE2b-256 | 3360ef0cb3f6f9018fa566bde6812cf18790900a0a19078d60c2fe475cd2fdb1 |