Skip to main content

Utility to incrementally parse text

Project description

Text Walker

tests tests tests codecov License: MIT

Documentation

Getting Started

textwalker is a simple utility to incrementally parse (un|semi)structured text.

The textwalker API emulates how a complex regular expression is iteratively constructed. Typically, when constructing a regex, I'll construct a part of it; test it and build the next part.

Consider trying to parse an SQL table definition:

>>> text = """CREATE TABLE dbo.car_inventory
(
    cp_car_sk        integer               not null,
    cp_car_make_id   char(16)              not null,
)
WITH (OPTION (STATS = ON))"""

>>> from text_walker import TextWalker
>>> tw = TextWalker(text)

>>> tw.walk('CREATE')
>>> tw.walk('TABLE')

The TextWalker class is initialized with the text to parse. The walk(pattern) method consumes and returns the pattern. Here, the return value is the literal matched. This pattern can be a string representing a:

  • literal, e.g. foo
  • character set, with character ranges and individual characters e.g. [a-z9]
  • grouping, e.g. (foo)+

See supported grammar here.

Internally, when walk is invoked the TextWalker tracks how much of the input text has been matched.

This is essentially, the key thought behind the design: by making the text parsing stateful, it can be done incrementally, and this reduces the complexity of the expression for matching text and allows combining with python text processing capabilities.

>>> table_name_match = tw.walk('dbo.[a-z0-9_]+')
>>> tablename = table_ame_match.replace('dbo.', '')
>>> print(f'table name is {tablename}')

table name is car_inventory

>>> tw.walk('\(')

# now print column names
>>> cols_text, _ = tw.walk_until('WITH')
>>> for col_def in cols_text.split(','):
        col_name = col_def.strip().split(' ')[0]
        print(f'column name is: {}')

column name is cp_car_sk
column name is cp_car_make_id

Or trying to parse a phone number, e.g.

>>> from textwalker import TextWalker
>>> text = "(+1)123-456-7890"
>>> tw = TextWalker(text)
>>> area_code = tw.walk('(\\(\\+[0-9]+\\))?')
>>> print(f'area code is {area_code}')

Note, special characters need to be escaped in all contexts.

>>> steps = tw.walk_many(['[0-9]{3,3}', '\\-', '[0-9]{3,3}', '\\-', '[0-9]{4,4}'])
>>> print(f'first 3 digits are {steps[0]}; next 3 digits are {steps[2]}; last 3 digits are {steps[4]}')
first 3 digits are 123; next 3 digits are 456; last 3 digits are 7890

More Examples

See more examples in .\examples

Installation

Textwalker is available on PyPI:

python -m pip install textwalker

Grammar

Literals

  • Can be any literal string
foo
bar 
123
x?
  • Can have quantifiers

Character Sets

  • A character set is defined within a pair of left and right square brackets, [...]
  • Can contain ranges, specified via a dash, [a-z] or individual chars [a-z8]
  • Support quantifiers, [0-9]{1,3}
  • NOTE: There are no predefined ranges!

Groups

  • A group is defined with a pair of parentheses (...)
  • A group can contain Literals, Character Sets and arbitrarily nested Groups, (hello[a-zA-z]+)*

Quantifiers

  • zero or more *
  • zero or one ?
  • one or more +
  • range {1,3}

Special Characters

  • Special characters (below) need to be escaped in all contexts.
"(", ")", "[", "]", "{", "}", "-", "+", "*", "?"
  • To escape a character it must be escaped with a double backslash, e.g. left parentheses \\(
  • This need two backslashes, because a single \ is treated by the python interpreter as an escape on the following character.
  • Even in cases, where a special character is unambiguously non-special, e.g. [*], can only mean match the literal * character, it must still be escaped. [*] is an invalid expression.

Limitations/Gotchas/Notes

  • The matching semantics are such that a pattern must fully match to be considered a match. For the walk methods None means not a match. This is different from a match of zero length, e.g. (foo)?
  • If a quantifier is not specified it must have exactly one match.
  • charset ranges match depend on how lexical comparison is implemented in python
  • only supports case-sensitive search
  • all operators are greedy. This is noteworthy, because in some cases, a non-greedy match on a sub-group would lead to match on the entire e.g. if matching (ab)*ab, the text abab will be a non match, since the subexpression (ab)* will consume the entire text. This can be avoided by, e.g. (ab){1,1}ab would match abab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textwalker-0.1.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

textwalker-0.1.1-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file textwalker-0.1.1.tar.gz.

File metadata

  • Download URL: textwalker-0.1.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for textwalker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 18eb2994ce32acc6480095349f2f07d9b0b281b602c5dc1ad38f4abdf5f6e5f7
MD5 492a8be76558cd696c443022a8b44e58
BLAKE2b-256 aed484ec16c3a6fadc6c5daa133aec25511512b91f89c79ef74fd09b0466341c

See more details on using hashes here.

File details

Details for the file textwalker-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: textwalker-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for textwalker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2e0b9cbad46154be963833a4b6697eae24c8e8aac97252efd278dafacd3e205b
MD5 bc80948780c43fafebda15563f19987f
BLAKE2b-256 7b7d61a376740da5708e0d6e3b75d6cd8b8911c0cf7236be6fb7ca5a0864507f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page