A versatile token stream for handwritten parsers
Project description
tokenstream
A versatile token stream for handwritten parsers.
from tokenstream import TokenStream
def parse_sexp(stream: TokenStream):
"""A basic S-expression parser."""
with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
brace, number, name = stream.expect(("brace", "("), "number", "name")
if brace:
return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
elif number:
return int(number.value)
elif name:
return name.value
print(parse_sexp(TokenStream("(hello (world 42))"))) # ['hello', ['world', 42]]
Introduction
Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected. In particular, handling indentation and reporting proper syntax errors can be pretty challenging. This package provides a powerful general-purpose token stream that addresses these issues and more.
Features
- Define token types with regular expressions
- The set of recognizable tokens can be defined dynamically during parsing
- Transparently skip over irrelevant tokens
- Expressive API for matching, collecting, peeking, and expecting tokens
- Clean error reporting with line numbers and column numbers
- Natively understands indentation-based syntax
- Works well with Python 3.10+ match statements
Installation
The package can be installed with pip
.
pip install tokenstream
Getting started
You can define tokens with the syntax()
method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"):
print([token.value for token in stream]) # ['hello', 'world']
The token stream is iterable and will yield all the extracted tokens one after the other.
Match statements
Match statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.
from tokenstream import TokenStream, Token
def parse_sexp(stream: TokenStream):
"""A basic S-expression parser that uses Python 3.10+ match statements."""
with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
match stream.expect_any(("brace", "("), "number", "name"):
case Token(type="brace"):
return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
case Token(type="number") as number :
return int(number.value)
case Token(type="name") as name:
return name.value
Contributing
Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry
.
$ poetry install
You can run the tests with poetry run pytest
.
$ poetry run pytest
The project must type-check with pyright
. If you're using VSCode the pylance
extension should report diagnostics automatically. You can also install the type-checker locally with npm install
and run it from the command-line.
$ npm run watch
$ npm run check
$ npm run verifytypes
The code follows the black
code style. Import statements are sorted with isort
.
$ poetry run isort tokenstream examples tests
$ poetry run black tokenstream examples tests
$ poetry run black --check tokenstream examples tests
License - MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tokenstream-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 967f85ae2d8e18b3c36b97b3a39576c8feedb790a9816755276b4fc8db1c163c |
|
MD5 | c2e9fcb0db098a3ab60582178fe580a1 |
|
BLAKE2b-256 | b4f9e7369cdde50561386e57c6b5195810d38badab71cc9d973087091c963172 |