Tools to make a compiler in python
Project description
compyler
Tools to make a compiler in python
Requirements
Lexer
Registering tokens
A Lexer
object can be used to register tokens.
>>> from compyler import Lexer
>>> lexer = Lexer()
>>> lexer.add_token(name='INT', regex=r'0|[1-9][0-9]*')
The tokens are registered in order of importance.
>>> from compyler import Lexer
>>> lexer = Lexer()
>>> lexer.add_token('ID', r'[a-zA-Z_$][a-zA-Z0-9_$]*')
>>> lexer.add_token('STRING', r'\"(.|[ \t])*\"')
In the exemple above, two tokens are registered. Despite ID
being a subset of STRING
the lexer's greedy search makes so that the biggest match is found first, thus, STRING: "spam"
is caught before than ID: eggs
.
Scanning a string
To scan a string, the Lexer.tokenize
method can be invoked.
>>> from compyler import Lexer
>>> lexer = Lexer()
>>> lexer.add_token(name='INT', regex=r'0|[1-9][0-9]*')
>>> lexer.add_token('ID', r'[a-zA-Z_$][a-zA-Z0-9_$]*')
>>> lexer.add_token('STRING', r'\"(.|[ \t])*\"')
>>> lexer.tokenize('123 "spam" eggs')
[INT: 123, STRING: "spam", ID: eggs]
Filtering
A tokenized string can also be filtered to remove unwanted tokens:
>>> from compyler import Lexer
>>> lexer = Lexer()
>>> lexer.add_token(name='INT', regex=r'0|[1-9][0-9]*')
>>> lexer.add_token('ID', r'[a-zA-Z_$][a-zA-Z0-9_$]*')
>>> lexer.add_token('STRING', r'\"(.|[ \t])*\"')
>>> lexer.add_token("COMMENT", r"#[^\n]*\n*$")
>>> buffer = lexer.tokenize('123 "spam" eggs')
>>> lexer.filter({"COMMENT"}, buffer)
Shift Reduce Parser
Registering productions
Productions can be created and registered using the LALRParser
class:
>>> from compyler import LALRParser
>>> lalr_parser = LALRParser()
>>> lalr_parser.add_production(
... "ProductionName",
... {
... ("Token1", "EOF"): (0,),
... ("Token1", "Token2", "EOF"): (0,1)
... }
... )
A production must also include the indices of which tokens or other productions will be used as children on the AST.
This means that if the production is:
Vardecl: ID EQ INT PLUS INT SEMICOLON
And the indices are (2,4)
The result in the AST would be:
Vardecl
| INT
| INT
On the example above:
>>> from compyler import LALRParser
>>> lalr_parser = LALRParser()
>>> lalr_parser.add_production(
... "ProductionName",
... {
... ("Token1", "EOF"): (0,),
... ("Token1", "Token2", "EOF"): (0,1)
... }
... )
>>> lalr_parser[0]
ProductionName: Token1 EOF -> $0
| Token1 Token2 EOF -> $0 $1
Parsing a tokenized string
After registering the productions on the parser a tokenized string can be parsed:
>>> from compyler import Lexer, LALRParser
>>> lexer = Lexer()
>>> lexer.add_token("ID", r"[a-zA-Z_$][a-zA-Z0-9_$]*")
>>> lexer.add_token("ASSIGN", r"[ \t]*=[ \t]")
>>> lexer.add_token("INT", r"0|[1-9][0-9]*")
>>> lexer.add_token("SEMICOLON", r"[ \t]*;")
>>> lalr_parser = LALRParser()
>>> lalr_parser.add_production(
... "VarDecl", {
... ("ID", "ASSIGN", "INT", "SEMICOLON"): (0,2)
... }
... )
>>> buffer = lexer.tokenize("var = 1;")
>>> lalr_parser.parse(buffer)
VarDecl
The result of the parsing process will either be a ASTNone
on success object or None
in case the parsing fails.
Accessing a AST node's children
After the parsing is complete and a ASTNode
object is generated one can access it's children by indexing the object.
On the example above:
>>> parsed_ast = lalr_parser.parse(buffer)
>>> parsed_ast[0]
ID: var
Getting the AST's representation
The parsed AST can also be shown using a basic text representation.
This is returned by calling the representation()
method.
On the example above:
>>> parsed_ast.representation()
VarDecl
| ID: var
| INT: 1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for compyler_tools-0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70bbeb52668edfe9910b2486565e8d8b6ed844d03bdb32d7c1ccd79e826a612c |
|
MD5 | 1406517a765daa817793b4e7c03b9e89 |
|
BLAKE2b-256 | 6dfa40ac6f02c202bcd9d1788baba26dfc6eac1c8faf46b205afa970937dffe8 |