Skip to main content
Help us improve PyPI by participating in user testing. All experience levels needed!

Expands a regular expression to its possible matches

Project description

Quick Start

The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently. It uses the parsed regular expression, so you get a much more accurate result than trying to just split strings.

>>> s = 'foo|ba[rz]'
>>> s.split('|')  # bad
['foo', 'ba[rz]']

>>> import sre_yield
>>> list(sre_yield.AllStrings(s))  # better
['foo', 'bar', 'baz']

It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though – these are cases that sre_parse did not optimize.

>>> import sre_yield
>>> list(sre_yield.AllStrings('.|a', charset='ab'))
['a', 'b', 'a']

…and happens in simpler cases too:

>>> list(sre_yield.AllStrings('a|a'))
['a', 'a']
>>> list(sre_yield.AllStrings('[aa]'))
['a', 'a']

Quirks

The membership check, 'abc' in values_obj is by necessity fullmatch – it must cover the entire string. Imagine that it has ^(...)$ around it. Because re.search can match anywhere in an arbitrarily string, emulating this would produce a large number of junk matches – probably not what you want. (If that is what you want, add a .* on either side.)

Here’s a quick example, using the presidents regex from http://xkcd.com/1313/

>>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'

>>> import re
>>> re.search(s, 'kennedy') is not None  # note .search
True
>>> v = sre_yield.AllStrings(s)
>>> v.__len__()
23
>>> 'bu' in v
True
>>> v[:5]
['bu', 'rt', 'nt', 'ce', 'oe']

If you do want to emulate search, you end up with a large number of matches quickly. Limiting the repetition a bit helps, but it’s still a very large number.

>>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
>>> v2.__len__()  # too big for int
57220492262913872576843611006974799576789176661653180757625052079917448874638816841926032487457234703154759402702651149752815320219511292208238103L
>>> 'kennedy' in v2
True

Capturing Groups

If you’re interested in extracting what would match during generation of a value, you can use AllMatches instead to get Match objects.

>>> v = sre_yield.AllMatches(r'a(\d)b')
>>> m = v[0]
>>> m.group(0)
'a0b'
>>> m.group(1)
'0'

This even works for simplistic backreferences, in this case to have matching quotes.

>>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
>>> m = v[0]
>>> m.group(0)
'"000"'
>>> m.groups()
('"', '000')
>>> m.group(1)
'"'
>>> m.group(2)
'000'

Reporting Bugs, etc.

We welcome bug reports – see our issue tracker on Google Code to see if it’s been reported before. If you’d like to discuss anything, we have a Google Group as well.

Differences between sre_yield and the re module

There are certainly valid regular expressions which sre_yield does not handle. These include things like lookarounds, backreferences, but also a few other exceptions:

  • The maximum value for repeats is system-dependant – CPython’s sre module there’s a special value which is treated as infinite (either 2**16-1 or 2**32-1 depending on build). In sre_yield, this is taken as a literal, rather than infinite, thus (on a 2**16-1 platform):

    >>> len(sre_yield.AllStrings('a*')[-1])
    65535
    >>> import re
    >>> len(re.match('.*', 'a' * 100000).group(0))
    100000
    
  • The re module docs say “Regular expression pattern strings may not contain null bytes” yet this appears to work fine.

  • Order does not depend on greediness.

  • The regex is treated as fullmatch.

  • sre_yield is confused by even the simplest of anchors:

    >>> list(sre_yield.AllStrings('foo$'))
    []
    

Project details


Release history Release notifications

This version
History Node

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
sre_yield-1.0.tar.gz (18.1 kB) Copy SHA256 hash SHA256 Source None Apr 14, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page