Skip to main content

Python wrapper for Google's RE2 using Cython

Project description

Summary

pyre2 is a Python extension that wraps Google’s RE2 regular expression library. The RE2 engine compiles (strictly) regular expressions to deterministic finite automata, which guarantees linear-time behavior.

Intended as a drop-in replacement for re. Unicode is supported by encoding to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given. For best performance, work with UTF-8 encoded bytes strings.

Backwards Compatibility

The stated goal of this module is to be a drop-in replacement for re, i.e.:

try:
    import re2 as re
except ImportError:
    import re

That being said, there are features of the re module that this module may never have; these will be handled through fallback to the original re module``:

  • lookahead assertions (?!...)

  • backreferences (\\n in search pattern)

  • W and S not supported inside character classes

On the other hand, unicode character classes are supported (e.g., \p{Greek}). Syntax reference: https://github.com/google/re2/wiki/Syntax

However, there are times when you may want to be notified of a failover. The function set_fallback_notification determines the behavior in these cases:

try:
    import re2 as re
except ImportError:
    import re
else:
    re.set_fallback_notification(re.FALLBACK_WARNING)

set_fallback_notification takes three values: re.FALLBACK_QUIETLY (default), re.FALLBACK_WARNING (raise a warning), and re.FALLBACK_EXCEPTION (raise an exception).

Installation

Prerequisites:

  • The re2 library from Google

  • The Python development headers (e.g. sudo apt-get install python-dev)

  • A build environment with gcc or clang (e.g. sudo apt-get install build-essential)

  • Cython 0.20+ (pip install cython)

After the prerequisites are installed, install as follows (pip3 for python3):

$ pip install https://github.com/andreasvc/pyre2/archive/master.zip

For development, get the source:

$ git clone git://github.com/andreasvc/pyre2.git
$ cd pyre2
$ make install

(or make install3 for Python 3)

Documentation

Consult the docstring in the source code or interactively through ipython or pydoc re2 etc.

Unicode Support

Python bytes and unicode strings are fully supported, but note that RE2 works with UTF-8 encoded strings under the hood, which means that unicode strings need to be encoded and decoded back and forth. There are two important factors:

  • whether a unicode pattern and search string is used (will be encoded to UTF-8 internally)

  • the UNICODE flag: whether operators such as \w recognize Unicode characters.

To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass UTF-8 encoded bytes strings directly but still treat them as unicode:

In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e']
In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'))
Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']

However, note that the indices in Match objects will refer to the bytes string. The indices of the match in the unicode string could be computed by decoding/encoding, but this is done automatically and more efficiently if you pass the unicode string:

>>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
<re2.Match object; span=(10, 12), match='\xc3\xbc'>
>>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE)
<re2.Match object; span=(9, 10), match=u'\xfc'>

Finally, if you want to match bytes without regard for Unicode characters, pass bytes strings and leave out the UNICODE flag (this will cause Latin 1 encoding to be used with RE2 under the hood):

>>> re2.findall(br'.', b'\x80\x81\x82')
['\x80', '\x81', '\x82']

Performance

Performance is of course the point of this module, so it better perform well. Regular expressions vary widely in complexity, and the salient feature of RE2 is that it behaves well asymptotically. This being said, for very simple substitutions, I’ve found that occasionally python’s regular re module is actually slightly faster. However, when the re module gets slow, it gets really slow, while this module buzzes along.

In the below example, I’m running the data against 8MB of text from the colossal Wikipedia XML file. I’m running them multiple times, being careful to use the timeit module. To see more details, please see the performance script.

Test

Description

# total runs

re time(s)

re2 time(s)

% re time

regex time(s)

% regex time

Findall URI|Email

Find list of ‘([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)’

2

6.262

0.131

2.08%

5.119

2.55%

Replace WikiLinks

This test replaces links of the form [[Obama|Barack_Obama]] to Obama.

100

4.374

0.815

18.63%

1.176

69.33%

Remove WikiLinks

This test splits the data by the <page> tag.

100

4.153

0.225

5.43%

0.537

42.01%

Feel free to add more speed tests to the bottom of the script and send a pull request my way!

Current Status

The tests show the following differences with Python’s re module:

  • The $ operator in Python’s re matches twice if the string ends with \n. This can be simulated using \n?$, except when doing substitutions.

  • pyre2 and Python’s re may behave differently with nested groups.

    See tests/emptygroups.txt for the examples.

Please report any further issues with pyre2.

Tests

If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It’s actually really easy:

  • Come up with regular expression problems using the regular python ‘re’ module.

  • Write a session in python traceback format Example.

  • Replace your import re with import re2 as re.

  • Save it as a .txt file in the tests directory. You can comment on it however you like and indent the code with 4 spaces.

Credits

This code builds on the following projects (in chronological order):

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyre2-0.3.1.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

pyre2-0.3.1-cp38-cp38-manylinux2010_x86_64.whl (958.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

pyre2-0.3.1-cp38-cp38-manylinux2010_i686.whl (938.0 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

pyre2-0.3.1-cp38-cp38-macosx_10_9_x86_64.whl (152.2 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pyre2-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl (845.0 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

pyre2-0.3.1-cp37-cp37m-manylinux2010_i686.whl (824.1 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686

pyre2-0.3.1-cp37-cp37m-macosx_10_9_x86_64.whl (150.2 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

pyre2-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl (849.8 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

pyre2-0.3.1-cp36-cp36m-manylinux2010_i686.whl (831.1 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686

pyre2-0.3.1-cp36-cp36m-macosx_10_9_x86_64.whl (155.6 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file pyre2-0.3.1.tar.gz.

File metadata

  • Download URL: pyre2-0.3.1.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1.tar.gz
Algorithm Hash digest
SHA256 0ab08c57e7d386376dc7351de1f51fc663f91d75d5837ba493be5a7c20dc043d
MD5 d3c24a64a5073d93fb56a87b43ec717c
BLAKE2b-256 1de6284a777c48419140b826d21a1d4a53b10df1f54d2abd1637d5a83de03f90

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 958.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6aaa1b9977d5c6bb277bbb1e9be04eceb2fd002c6498a589f71c6e527dab89f8
MD5 ee91f9b364ec091a98472a19a5d0decb
BLAKE2b-256 c5a6f0c899584f8e01a1d1f8d56201bd6a9e4774e44bfeda1ec2fbfe029ff093

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp38-cp38-manylinux2010_i686.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp38-cp38-manylinux2010_i686.whl
  • Upload date:
  • Size: 938.0 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp38-cp38-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 afa1ba2d68add63b8d6c940918c47439fa8653c59365b813dde07043780a3493
MD5 fd7cd9cd128b437f2b8ede47cef4810e
BLAKE2b-256 169f85c90a84f66e2f48ac7c242c7b110118866b7af5fcedd031a1e17f3d1224

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 152.2 kB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4ad3233d1fb5a8cbeca9c6bacce9497a597a4724f14d1080d3f849928f7cd03f
MD5 33ff093c01d1c9fe325cdfd562d9a75a
BLAKE2b-256 06bdf231368c5698a82e3114f70cf031a89238d39fa57eb25048bf0f7d3283da

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 845.0 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f718af45c5741f9d84339f6e1f6c4a93188adb54f160195c1b21f88bdb89bd79
MD5 2476eda2d096e9538ee8ea2cfae1b35e
BLAKE2b-256 eed969ce91461747fa3d22290bd2299cfd6ab8f036757384a8b352131dbac969

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp37-cp37m-manylinux2010_i686.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp37-cp37m-manylinux2010_i686.whl
  • Upload date:
  • Size: 824.1 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp37-cp37m-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 0bd83947a22dc776020ada50f885980936a8e6585cd234d90eac396c2b803587
MD5 cab91e3687d8e9d46124d296424c756b
BLAKE2b-256 65600797663d1ab0db16f84fc162db816b7985f2f91786e501d8f376ca42cfd3

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 150.2 kB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6d4bbf895ac96234e39716513691d3c88f4b0649abb868a589ab73e2fab82ea0
MD5 04f9aacc6cb6f7e3d6e9415d294f508d
BLAKE2b-256 0de501ace31f72e42e8d1420957d513d988a8d0cc041f64e62bec90c136fc54d

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 849.8 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7ebbef3f3ece0365809792e025cf17bb977382192088ce97ee893e35e00dac7e
MD5 bded9157c561709453c48433901fee25
BLAKE2b-256 b6cb20a8a151f07ea8ee1114d9e12d8f2056a4f0ae2192e73b3b96670b9b6d04

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp36-cp36m-manylinux2010_i686.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp36-cp36m-manylinux2010_i686.whl
  • Upload date:
  • Size: 831.1 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp36-cp36m-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 8be0e312c79d5ca79d6d32699e7acb406e31dde260c5e385a39d024aaba3baf5
MD5 ce4e407637447aeb4fe2e12c59cbcfd8
BLAKE2b-256 40e8d1642fe2549af90159eda1761eadd722d130152bf670db4ef938364f5cfc

See more details on using hashes here.

File details

Details for the file pyre2-0.3.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.1-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 155.6 kB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8c7981e58bca5bd01fb7ed5a1f87018e94964984cc78377e02830bc575ff3890
MD5 16903abd03a972e207f1ccce74f4e4b8
BLAKE2b-256 990bba54b6bf66f9a271d2ef7c34055f03d808e7dc0c620aa96fa4afef847f1c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page