A versatile token stream for handwritten parsers

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

tokenstream

A versatile token stream for handwritten parsers.

from tokenstream import TokenStream

def parse_sexp(stream: TokenStream):
    """A basic S-expression parser."""
    with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
        brace, number, name = stream.expect(("brace", "("), "number", "name")
        if brace:
            return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
        elif number:
            return int(number.value)
        elif name:
            return name.value

print(parse_sexp(TokenStream("(hello (world 42))")))  # ['hello', ['world', 42]]

Introduction

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.

Features

Define the set of recognizable tokens dynamically with regular expressions
Transparently skip over irrelevant tokens
Expressive API for matching, collecting, peeking, and expecting tokens
Clean error reporting with line numbers and column numbers
Contextual support for indentation-based syntax
Checkpoints for backtracking parsers
Works well with Python 3.10+ match statements

Installation

The package can be installed with pip.

pip install tokenstream

Getting started

You can define tokens with the syntax() method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    print([token.value for token in stream])  # ['hello', 'world']

The token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the expect() method.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    print(stream.expect().value)  # "hello"
    print(stream.expect().value)  # "world"

The expect() method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.

stream = TokenStream("hello world")

with stream.syntax(number=r"\d+", word=r"\w+"):
    print(stream.expect("word").value)  # "hello"
    print(stream.expect("number").value)  # UnexpectedToken: Expected number but got word 'world'

Newlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in newline and whitespace tokens.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"), stream.intercept("newline", "whitespace"):
    print(stream.expect("word").value)  # "hello"
    print(stream.expect("word").value)  # UnexpectedToken: Expected word but got whitespace ' '

The opposite of the intercept() method is ignore(). It allows you to ignore tokens and handle comments pretty easily.

stream = TokenStream(
    """
    # this is a comment
    hello # also a comment
    world
    """
)

with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.ignore("comment"):
    print([token.value for token in stream])  # ['hello', 'world']

To enable indentation you can use the indent() method. The stream will now yield balanced pairs of indent and dedent tokens when the indentation changes.

source = """
hello
    world
"""
stream = TokenStream(source)

with stream.syntax(word=r"\w+"), stream.indent():
    stream.expect("word")
    stream.expect("indent")
    stream.expect("word")
    stream.expect("dedent")

Match statements

Match statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.

from tokenstream import TokenStream, Token

def parse_sexp(stream: TokenStream):
    """A basic S-expression parser that uses Python 3.10+ match statements."""
    with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
        match stream.expect_any(("brace", "("), "number", "name"):
            case Token(type="brace"):
                return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
            case Token(type="number") as number :
                return int(number.value)
            case Token(type="name") as name:
                return name.value

Contributing

Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry.

$ poetry install

You can run the tests with poetry run pytest.

$ poetry run pytest

The project must type-check with pyright. If you're using VSCode the pylance extension should report diagnostics automatically. You can also install the type-checker locally with npm install and run it from the command-line.

$ npm run watch
$ npm run check
$ npm run verifytypes

The code follows the black code style. Import statements are sorted with isort.

$ poetry run isort tokenstream examples tests
$ poetry run black tokenstream examples tests
$ poetry run black --check tokenstream examples tests

License - MIT

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.7.0

Aug 3, 2023

1.6.0

Aug 2, 2023

1.5.0

Jan 5, 2023

1.4.2

Jun 2, 2022

1.4.1

May 2, 2022

1.4.0

Apr 12, 2022

1.3.4

Apr 6, 2022

1.3.3

Apr 2, 2022

1.3.2

Feb 19, 2022

1.3.1

Dec 6, 2021

1.3.0

Dec 1, 2021

1.2.7

Nov 30, 2021

1.2.6

Nov 25, 2021

1.2.5

Nov 25, 2021

1.2.4

Nov 25, 2021

1.2.3

Oct 20, 2021

1.2.2

Oct 9, 2021

1.2.1

Sep 24, 2021

1.2.0

Sep 19, 2021

1.1.0

Sep 16, 2021

1.0.5

Sep 15, 2021

1.0.4

Sep 14, 2021

1.0.3

Sep 14, 2021

1.0.2

Sep 14, 2021

1.0.1

Sep 8, 2021

1.0.0

Sep 8, 2021

0.7.5

Sep 4, 2021

0.7.4

Aug 30, 2021

0.7.3

Aug 30, 2021

0.7.2

Aug 29, 2021

0.7.1

Aug 28, 2021

0.7.0

Jul 28, 2021

0.6.2

Jul 23, 2021

0.6.1

Jul 22, 2021

0.6.0

Jul 22, 2021

0.5.0

Jun 24, 2021

0.4.1

Jun 17, 2021

0.4.0

Jun 17, 2021

This version

0.3.0

Jun 16, 2021

0.2.0

Jun 14, 2021

0.1.0

Jun 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenstream-0.3.0.tar.gz (10.0 kB view hashes)

Uploaded Jun 16, 2021 Source

Built Distribution

tokenstream-0.3.0-py3-none-any.whl (8.7 kB view hashes)

Uploaded Jun 16, 2021 Python 3

Hashes for tokenstream-0.3.0.tar.gz

Hashes for tokenstream-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6c0eaf89a90e57fe1cad27206377835226d1c591b26ac4d58356a04b64806e07`
MD5	`1827c790b35e5efcbc67bfea3eff3d0f`
BLAKE2b-256	`1b166c9b2a4d502b3fcea0227699a744946038035474e596a586dfcbbe947e94`

Hashes for tokenstream-0.3.0-py3-none-any.whl

Hashes for tokenstream-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25e1c2b6905d47aff5bcbcbca05d20fbff68cd57088637daac4f6183454aac4c`
MD5	`6fe2de75aa050760b5103dc51dd058ba`
BLAKE2b-256	`bb70eb7b1fd1365a55c24e09380b6d689933deb91a263d08362d1a8127b2a920`