seq-re

2-dimensional Sequence Regular Expression (SEQ RE)

These details have not been verified by PyPI

Project links

Homepage

Project description

This module provides regular expression matching operations over a sequence of tuples (or a sequence of sequence) data structure. It looks like the following:

seq_m_n = [[str_11, str_12, ... str_1n],
           [str_21, str_22, ... str_2n],
            ...,
           [str_m1, str_m2, ... str_mn]]

The sequence is a homogeneous 2D array, that is a matrix with m rows and n columns. In practice, m maybe vary from sequence to sequence, while n is usually a fixed-size.

A element in the tuple of the sequence can be considered as either a string, a word, a phrase, a char, a flag, a token or a tag, and maybe a set of tags or values (multi-values) in the future.

To match a pattern over a sequence of tuples, the SEQ RE patterns is written like one of the examples:

([;;PERSON]+) [was|has been] [an]? .{0,3} ([^painter|drawing artist|画家])

(?P<name@0,1,2>[;;PERSON]) [;VERB be;] [born] [on] (?P<birthday@0:3>([;;NUMBER|MONTH]|[-]){2,3})

1. The syntax of SEQ RE pattern

A SEQ RE pattern is very similar to the ordinary regular express (RE) used in Python, in which the delimiters [...] is to indicate a tuple – the second dimension of the sequence.

1.1 Inside [...]

[ and ]

is the beginning and end delimiter of the tuple, e.g. [...].
;

separates each element which the tuple contains, and the continuous ; at the tail can be omitted, e.g. [A|B;X;;], [A|B;X].
|

indicates the different values of one element, e.g. A|B. These values form a set, and any string in the set will be matched, e.g. A|B will match A or B.
^

be the first character of an element, all the string that are not in the value set of this element will be matched. And ^ has no special meaning if it’s not the first character of the element. If ^ comes the first character of an element but it is a part of a literal string, \^ should be used to escape it.
The priority of above-mentioned operations:

[ ] < ; < ^ (not literal) < | < ^ (literal) .
\

is an escaping symbol before aforementioned special characters. Characters other than ], : or \ lose their special meaning inside [...]. To express ], : or | in literal, \ should be added before ], : or |. Meanwhile, to represent a literal backslash \ before ], ; or |, \\ should be used in the plain text that is to say '\\\\' must be used in the Python code.

1.2 Outside [...]

The special meanings of special characters in the ordinary RE are available here, but with the limitations discussed below.
1. Not support [ and ] as special characters to indicate a set of characters.
2. Not support the following escaped special characters: \number, \A, \b, \B, \d, \D, \s, \S, \w, \W, \Z, \a, \b, \f, \n, \r, \t, \v, \x.
3. Not support ranges of characters, such as [0-9A-Za-z], [\u4E00-\u9FBB\u3007] (Unihan and Chinese character 〇) used in ordinary RE.
4. The whitespace and non-special characters are ignored.
. is an abbreviation of an arbitrary tuple [] or [;].
The named groups in the pattern are very useful. As an extension, a format string starting with @ can be followed after the group name, to describe which element of the tuples belonging this group will be output as the result. For example: (?P<name@d1,d2:d3>...), in which d1, d2 and d3 are all 0-based position index number of elements in the tuple.
1. @0,2:4 means in the matched result only the 0th and from 2nd to 3rd elements of tuples will be output.
2. @@ means the pattern of the group itself will be output other than the matched result. one can choose whether to include the group name and parentheses or not.
3. @ means all elements of tuples in the matched result will be output.

1.3 Boolean logic in the [...]

Given a sequence of 3-tuple [[s1, s2, s3], ... ],

AND

[X;;Y] will match s1 == X && s3 == Y. Its behavior looks like the ordinary RE pattern (?:X.Y).
OR

[X;;]|[;;Y] will match s1 == X || s3 == Y. Its behavior looks like the ordinary RE pattern (?:X..)|(?:..Y)
NOT

If [;^P;] will match s2 != P. Its behavior looks like the ordinary RE pattern (?:.[^P].).

We can also use a negative lookahead assertion of the ordinary RE, to give a negative covering its following. e.g. (?![;P;][Q])[;;][;;] <==> [;^P;][^Q;;], which behavior looks like the ordinary RE pattern (?!(?:.P.)(?:Q..))....

2. Notes

Not support comparing the number of figures.

Multi-values of one element is not supported now, but this feature may be improved in the future.

Although SEQ RE has sufficient ability to express a pattern over sequences of tuples, it is still not a cascaded regular expressions (see also: Stanford TokensRegex).

3. Examples

The usage of seq_re module:

from __future__ import print_function
import seq_re

n = 3
pattern = ('(?P<name@0>[;;PERSON]+) [is|was|has been] [a|an]? '
           '(?P<attrib@0,1>.{0,3}) ([artist])')
seq = [['Vincent van Gogh', 'NNP', 'PERSON'],
       ['was', 'VBD', 'O'],
       ['a', 'DT', 'O'],
       ['Dutch', 'JJ', 'O'],
       ['Post-Impressionist', 'NN', 'O'],
       ['painter', 'NN', 'OCCUPATION'],
       ['who', 'WP', 'O'],
       ['is', 'VBZ', 'O'],
       ['among', 'IN', 'O'],
       ['the', 'DT', 'O'],
       ['most', 'RBS', 'O'],
       ['famous', 'JJ', 'O'],
       ['and', 'CC', 'O'],
       ['influential', 'JJ', 'O'],
       ['figures', 'NNS', 'O'],
       ['in', 'IN', 'O'],
       ['the', 'DT', 'O'],
       ['history', 'NN', 'O'],
       ['of', 'IN', 'O'],
       ['Western art', 'NNP', 'DOMAIN'],
       ['.', '.', 'O']]
placeholder_dict = {'artist': ['painter', 'drawing artist']}

sr = seq_re.SeqRegex(n).compile(pattern, **placeholder_dict)
match = sr.search(seq)
if match:
    for g in match.group_list:
        print(' '.join(['`'.join(tup) for tup in g[1]]))
    for name in sorted(match.named_group_dict,
                       key=lambda gn: match.named_group_dict[gn][0]):
        print(name, match.format_group_to_str(name, True))

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.1

Apr 25, 2017

0.2

Apr 19, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seq_re-0.2.1.tar.gz (22.0 kB view details)

Uploaded Apr 25, 2017 Source

File details

Details for the file seq_re-0.2.1.tar.gz.

File metadata

Download URL: seq_re-0.2.1.tar.gz
Upload date: Apr 25, 2017
Size: 22.0 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for seq_re-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f26aa53ed1c49bff18bf3aadb98f0b58944b406e4b503b024c129b5060bb97f1`
MD5	`941f32ef91938bdf8905d51cfd6e7dcb`
BLAKE2b-256	`01fc3b3ab30b98bb7e8b9ae525615251686f35580c54ada0a246eebe374f6516`

See more details on using hashes here.

seq-re 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

1. The syntax of SEQ RE pattern

1.1 Inside [...]

1.2 Outside [...]

1.3 Boolean logic in the [...]

2. Notes

3. Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes