A versatile token stream for handwritten parsers
Project description
tokenstream
A versatile token stream for handwritten parsers.
from tokenstream import TokenStream
def parse_sexp(stream: TokenStream):
"""A basic S-expression parser."""
with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
brace, number, name = stream.expect(("brace", "("), "number", "name")
if brace:
return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
elif number:
return int(number.value)
elif name:
return name.value
print(parse_sexp(TokenStream("(hello (world 42))"))) # ['hello', ['world', 42]]
Introduction
Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.
Features
- Define the set of recognizable tokens dynamically with regular expressions
- Transparently skip over irrelevant tokens
- Expressive API for matching, collecting, peeking, and expecting tokens
- Clean error reporting with line numbers and column numbers
- Contextual support for indentation-based syntax
- Checkpoints for backtracking parsers
- Works well with Python 3.10+ match statements
Installation
The package can be installed with pip
.
pip install tokenstream
Getting started
You can define tokens with the syntax()
method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"):
print([token.value for token in stream]) # ['hello', 'world']
The token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the expect()
method.
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"):
print(stream.expect().value) # "hello"
print(stream.expect().value) # "world"
The expect()
method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.
stream = TokenStream("hello world")
with stream.syntax(number=r"\d+", word=r"\w+"):
print(stream.expect("word").value) # "hello"
print(stream.expect("number").value) # UnexpectedToken: Expected number but got word 'world'
Newlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in newline
and whitespace
tokens.
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"), stream.intercept("newline", "whitespace"):
print(stream.expect("word").value) # "hello"
print(stream.expect("word").value) # UnexpectedToken: Expected word but got whitespace ' '
The opposite of the intercept()
method is ignore()
. It allows you to ignore tokens and handle comments pretty easily.
stream = TokenStream(
"""
# this is a comment
hello # also a comment
world
"""
)
with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.ignore("comment"):
print([token.value for token in stream]) # ['hello', 'world']
To enable indentation you can use the indent()
method. The stream will now yield balanced pairs of indent
and dedent
tokens when the indentation changes.
source = """
hello
world
"""
stream = TokenStream(source)
with stream.syntax(word=r"\w+"), stream.indent():
stream.expect("word")
stream.expect("indent")
stream.expect("word")
stream.expect("dedent")
Match statements
Match statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.
from tokenstream import TokenStream, Token
def parse_sexp(stream: TokenStream):
"""A basic S-expression parser that uses Python 3.10+ match statements."""
with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
match stream.expect_any(("brace", "("), "number", "name"):
case Token(type="brace"):
return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
case Token(type="number") as number :
return int(number.value)
case Token(type="name") as name:
return name.value
Contributing
Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry
.
$ poetry install
You can run the tests with poetry run pytest
.
$ poetry run pytest
The project must type-check with pyright
. If you're using VSCode the pylance
extension should report diagnostics automatically. You can also install the type-checker locally with npm install
and run it from the command-line.
$ npm run watch
$ npm run check
$ npm run verifytypes
The code follows the black
code style. Import statements are sorted with isort
.
$ poetry run isort tokenstream examples tests
$ poetry run black tokenstream examples tests
$ poetry run black --check tokenstream examples tests
License - MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tokenstream-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25e1c2b6905d47aff5bcbcbca05d20fbff68cd57088637daac4f6183454aac4c |
|
MD5 | 6fe2de75aa050760b5103dc51dd058ba |
|
BLAKE2b-256 | bb70eb7b1fd1365a55c24e09380b6d689933deb91a263d08362d1a8127b2a920 |