Utility to incrementally parse text
Project description
Text Walker
Documentation
Getting Started
textwalker
is a simple utility to incrementally parse (un|semi)structured text.
The textwalker
API emulates how a complex regular expression is iteratively constructed.
Typically, when constructing a regex, I'll construct a part of it; test it and build the next part.
Consider trying to parse an SQL table definition:
>>> text = """CREATE TABLE dbo.car_inventory
(
cp_car_sk integer not null,
cp_car_make_id char(16) not null,
)
WITH (OPTION (STATS = ON))"""
>>> from text_walker import TextWalker
>>> tw = TextWalker(text)
>>> tw.walk('CREATE')
>>> tw.walk('TABLE')
The TextWalker
class is initialized with the text
to parse.
The walk(pattern)
method consumes and returns the pattern
. Here, the return value is the literal matched.
This pattern
can be a string representing a:
- literal, e.g.
foo
- character set, with character ranges and individual characters e.g.
[a-z9]
- grouping, e.g.
(foo)+
See supported grammar here.
Internally, when walk
is invoked the TextWalker
tracks how much of the input text has been matched.
This is essentially, the key thought behind the design: by making the text parsing stateful, it can be done incrementally, and this reduces the complexity of the expression for matching text and allows combining with python text processing capabilities.
>>> table_name_match = tw.walk('dbo.[a-z0-9_]+')
>>> tablename = table_ame_match.replace('dbo.', '')
>>> print(f'table name is {tablename}')
table name is car_inventory
>>> tw.walk('\(')
# now print column names
>>> cols_text, _ = tw.walk_until('WITH')
>>> for col_def in cols_text.split(','):
col_name = col_def.strip().split(' ')[0]
print(f'column name is: {}')
column name is cp_car_sk
column name is cp_car_make_id
Or trying to parse a phone number, e.g.
>>> from textwalker import TextWalker
>>> text = "(+1)123-456-7890"
>>> tw = TextWalker(text)
>>> area_code = tw.walk('(\\(\\+[0-9]+\\))?')
>>> print(f'area code is {area_code}')
Note, special characters need to be escaped in all contexts.
>>> steps = tw.walk_many(['[0-9]{3,3}', '\\-', '[0-9]{3,3}', '\\-', '[0-9]{4,4}'])
>>> print(f'first 3 digits are {steps[0]}; next 3 digits are {steps[2]}; last 3 digits are {steps[4]}')
first 3 digits are 123; next 3 digits are 456; last 3 digits are 7890
More Examples
See more examples in .\examples
Installation
Textwalker is available on PyPI:
python -m pip install textwalker
Grammar
Literals
- Can be any literal string
foo
bar
123
x?
- Can have quantifiers
Character Sets
- A character set is defined within a pair of left and right square brackets,
[...]
- Can contain ranges, specified via a dash,
[a-z]
or individual chars[a-z8]
- Support quantifiers,
[0-9]{1,3}
- NOTE: There are no predefined ranges!
Groups
- A group is defined with a pair of parentheses
(...)
- A group can contain
Literals
,Character Sets
and arbitrarily nestedGroups
,(hello[a-zA-z]+)*
Quantifiers
- zero or more
*
- zero or one
?
- one or more
+
- range
{1,3}
Special Characters
- Special characters (below) need to be escaped in all contexts.
"(", ")", "[", "]", "{", "}", "-", "+", "*", "?"
- To escape a character it must be escaped with a double backslash, e.g. left parentheses
\\(
- This need two backslashes, because a single
\
is treated by the python interpreter as an escape on the following character. - Even in cases, where a special character is unambiguously non-special, e.g.
[*]
, can only mean match the literal*
character, it must still be escaped.[*]
is an invalid expression.
Limitations/Gotchas/Notes
- The matching semantics are such that a pattern must fully match to be considered a match. For the
walk
methodsNone
means not a match. This is different from a match of zero length, e.g.(foo)?
- If a quantifier is not specified it must have exactly one match.
- charset ranges match depend on how lexical comparison is implemented in python
- only supports case-sensitive search
- all operators are greedy. This is noteworthy, because in some cases, a non-greedy match on a sub-group would lead to match on the entire e.g. if matching
(ab)*ab
, the textabab
will be a non match, since the subexpression(ab)*
will consume the entire text. This can be avoided by, e.g.(ab){1,1}ab
would matchabab
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file textwalker-0.1.1.tar.gz
.
File metadata
- Download URL: textwalker-0.1.1.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18eb2994ce32acc6480095349f2f07d9b0b281b602c5dc1ad38f4abdf5f6e5f7 |
|
MD5 | 492a8be76558cd696c443022a8b44e58 |
|
BLAKE2b-256 | aed484ec16c3a6fadc6c5daa133aec25511512b91f89c79ef74fd09b0466341c |
File details
Details for the file textwalker-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: textwalker-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e0b9cbad46154be963833a4b6697eae24c8e8aac97252efd278dafacd3e205b |
|
MD5 | bc80948780c43fafebda15563f19987f |
|
BLAKE2b-256 | 7b7d61a376740da5708e0d6e3b75d6cd8b8911c0cf7236be6fb7ca5a0864507f |