Skip to main content

Python wrapper for Google's RE2 using Cython

Project description

Summary

pyre2 is a Python extension that wraps Google’s RE2 regular expression library. The RE2 engine compiles (strictly) regular expressions to deterministic finite automata, which guarantees linear-time behavior.

Intended as a drop-in replacement for re. Unicode is supported by encoding to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given. For best performance, work with UTF-8 encoded bytes strings.

Backwards Compatibility

The stated goal of this module is to be a drop-in replacement for re, i.e.:

try:
    import re2 as re
except ImportError:
    import re

That being said, there are features of the re module that this module may never have; these will be handled through fallback to the original re module``:

  • lookahead assertions (?!...)

  • backreferences (\\n in search pattern)

  • W and S not supported inside character classes

On the other hand, unicode character classes are supported (e.g., \p{Greek}). Syntax reference: https://github.com/google/re2/wiki/Syntax

However, there are times when you may want to be notified of a failover. The function set_fallback_notification determines the behavior in these cases:

try:
    import re2 as re
except ImportError:
    import re
else:
    re.set_fallback_notification(re.FALLBACK_WARNING)

set_fallback_notification takes three values: re.FALLBACK_QUIETLY (default), re.FALLBACK_WARNING (raise a warning), and re.FALLBACK_EXCEPTION (raise an exception).

Installation

Prerequisites:

  • The re2 library from Google

  • The Python development headers (e.g. sudo apt-get install python-dev)

  • A build environment with gcc or clang (e.g. sudo apt-get install build-essential)

  • Cython 0.20+ (pip install cython)

After the prerequisites are installed, install as follows (pip3 for python3):

$ pip install https://github.com/andreasvc/pyre2/archive/master.zip

For development, get the source:

$ git clone git://github.com/andreasvc/pyre2.git
$ cd pyre2
$ make install

(or make install3 for Python 3)

Documentation

Consult the docstring in the source code or interactively through ipython or pydoc re2 etc.

Unicode Support

Python bytes and unicode strings are fully supported, but note that RE2 works with UTF-8 encoded strings under the hood, which means that unicode strings need to be encoded and decoded back and forth. There are two important factors:

  • whether a unicode pattern and search string is used (will be encoded to UTF-8 internally)

  • the UNICODE flag: whether operators such as \w recognize Unicode characters.

To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass UTF-8 encoded bytes strings directly but still treat them as unicode:

In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e']
In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'))
Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']

However, note that the indices in Match objects will refer to the bytes string. The indices of the match in the unicode string could be computed by decoding/encoding, but this is done automatically and more efficiently if you pass the unicode string:

>>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
<re2.Match object; span=(10, 12), match='\xc3\xbc'>
>>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE)
<re2.Match object; span=(9, 10), match=u'\xfc'>

Finally, if you want to match bytes without regard for Unicode characters, pass bytes strings and leave out the UNICODE flag (this will cause Latin 1 encoding to be used with RE2 under the hood):

>>> re2.findall(br'.', b'\x80\x81\x82')
['\x80', '\x81', '\x82']

Performance

Performance is of course the point of this module, so it better perform well. Regular expressions vary widely in complexity, and the salient feature of RE2 is that it behaves well asymptotically. This being said, for very simple substitutions, I’ve found that occasionally python’s regular re module is actually slightly faster. However, when the re module gets slow, it gets really slow, while this module buzzes along.

In the below example, I’m running the data against 8MB of text from the colossal Wikipedia XML file. I’m running them multiple times, being careful to use the timeit module. To see more details, please see the performance script.

Test

Description

# total runs

re time(s)

re2 time(s)

% re time

regex time(s)

% regex time

Findall URI|Email

Find list of ‘([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)’

2

6.262

0.131

2.08%

5.119

2.55%

Replace WikiLinks

This test replaces links of the form [[Obama|Barack_Obama]] to Obama.

100

4.374

0.815

18.63%

1.176

69.33%

Remove WikiLinks

This test splits the data by the <page> tag.

100

4.153

0.225

5.43%

0.537

42.01%

Feel free to add more speed tests to the bottom of the script and send a pull request my way!

Current Status

The tests show the following differences with Python’s re module:

  • The $ operator in Python’s re matches twice if the string ends with \n. This can be simulated using \n?$, except when doing substitutions.

  • pyre2 and Python’s re may behave differently with nested groups.

    See tests/emptygroups.txt for the examples.

Please report any further issues with pyre2.

Tests

If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It’s actually really easy:

  • Come up with regular expression problems using the regular python ‘re’ module.

  • Write a session in python traceback format Example.

  • Replace your import re with import re2 as re.

  • Save it as a .txt file in the tests directory. You can comment on it however you like and indent the code with 4 spaces.

Credits

This code builds on the following projects (in chronological order):

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyre2-0.3.2.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

pyre2-0.3.2-cp38-cp38-manylinux2010_x86_64.whl (958.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

pyre2-0.3.2-cp38-cp38-manylinux2010_i686.whl (938.0 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

pyre2-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl (148.5 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pyre2-0.3.2-cp37-cp37m-manylinux2010_x86_64.whl (845.0 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

pyre2-0.3.2-cp37-cp37m-manylinux2010_i686.whl (824.1 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686

pyre2-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl (146.0 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

pyre2-0.3.2-cp36-cp36m-manylinux2010_x86_64.whl (849.8 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

pyre2-0.3.2-cp36-cp36m-manylinux2010_i686.whl (831.2 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686

pyre2-0.3.2-cp36-cp36m-macosx_10_9_x86_64.whl (151.4 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file pyre2-0.3.2.tar.gz.

File metadata

  • Download URL: pyre2-0.3.2.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2.tar.gz
Algorithm Hash digest
SHA256 2234533c16bfd49234fe9f7ca694ccc7b7e26e5642cac3fd7874ac73f144668b
MD5 5f0845ff196d2b23b9a2e464298e4e74
BLAKE2b-256 20cc7212a5f606083b0851005f8160ca732899e3c1aba766e4d7215f0d6e6688

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 958.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2b710ad06de4ee1760eda6cac4e4019b6da7903ea5976a0c3a12264625ce8f26
MD5 64f269b86c221f7c3e9f3045285545b3
BLAKE2b-256 71f5ae9be8f5698e6d95082f53cdc713ffae4d5f9e52b107f3deb5425405a11b

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp38-cp38-manylinux2010_i686.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp38-cp38-manylinux2010_i686.whl
  • Upload date:
  • Size: 938.0 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp38-cp38-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 c1d90b5e6f511344e1d3af95b52cd3574038865efbcf190d0934349d379f5909
MD5 98d51958427eb84455d624bccca38e32
BLAKE2b-256 fc9263cdc08e9fcd87b20a8dabc12b4fed3e87531a3cc44f0d1cb80a4aef2146

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 148.5 kB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 eef621552a469fbaf78028daa3c8067addbc9775b7fda4f28bf440389b0c74d1
MD5 f69db426b8836d427da1d78928028978
BLAKE2b-256 ca06706beb75a81de6f99806d09f566445c67502e4535620a7fb8ec392e1c43c

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 845.0 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 704c07f1912c42799c2f485f790375faf450d9714107d2817445a610659b21cb
MD5 f4cc53209a5c9e2fed77d175ce3be64b
BLAKE2b-256 49d2f6a93a0cd02fe888060f2fe505db52d6a49470d561fabd10baf3ca799996

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp37-cp37m-manylinux2010_i686.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp37-cp37m-manylinux2010_i686.whl
  • Upload date:
  • Size: 824.1 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp37-cp37m-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 a8d69fc1874f3ccb7862289dc89b7f676a750659df50cf071cf24b50ced6e2cc
MD5 0fae7a7bc18fb87ff17b93d8847b7702
BLAKE2b-256 82c36a253375fa3c23cae0c97c5f4527734a71bbb8152120085ac02c78f48ca3

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 146.0 kB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 75de27078e54ad61f19dfaef8be274a2f27d586c716f51fab3eba00f0e682a34
MD5 09a0c1736d18501d43c934d426f6883e
BLAKE2b-256 663e6ccd1e701f6a43c2736c06279456bcdeddaefc3a9cd2cc50ab94e09089a2

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 849.8 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c872f3fea0d8e97236e5018c6d620751570b71cc9525a8759794b0d59ad22af8
MD5 089eae2f5a51d24a0287806c5e3795ff
BLAKE2b-256 755cafc8973be56df8b0ff3bd5bc0a13cf076b2d48338cc193c3dd3c3aeeaca4

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp36-cp36m-manylinux2010_i686.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp36-cp36m-manylinux2010_i686.whl
  • Upload date:
  • Size: 831.2 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp36-cp36m-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 2c272cc1abaa37ecc657a894747971e4a19dd84ff6e74cfe8d5a02a434704ee1
MD5 b0ab40a3fd8ccf9efbbd8d88e836bba7
BLAKE2b-256 5f793470926b69bacdba1021c60530c034e7ef35730249f7e9446f0f1a6c0751

See more details on using hashes here.

File details

Details for the file pyre2-0.3.2-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyre2-0.3.2-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 151.4 kB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6

File hashes

Hashes for pyre2-0.3.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e83338ac8b620dc3eeafd814822755f975d8316cdce872997bebc03f58a13ad7
MD5 6aa4749508ed3643d0cb1b12482dd4f8
BLAKE2b-256 ae672636cd6473e0cb96fab8106097438b4ce9568091c18f82e59188fc82f94c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page