Skip to main content

Python wrapper for Google's RE2 using Cython

Project description

Maintainer’s Note

This is an updated fork of [pyre2](https://github.com/andreasvc/pyre2). It has built wheels for newer Python versions.

All docs below are taken from the pyre2 package.

Summary

pyre2 is a Python extension that wraps Google’s RE2 regular expression library. The RE2 engine compiles (strictly) regular expressions to deterministic finite automata, which guarantees linear-time behavior.

Intended as a drop-in replacement for re. Unicode is supported by encoding to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given. For best performance, work with UTF-8 encoded bytes strings.

Installation

Normal usage for Linux/Mac/Windows:

$ pip install pyre2-updated

Compiling from source

Requirements for building the C++ extension from the repo source:

  • A build environment with gcc or clang (e.g. sudo apt-get install build-essential)

  • Build tools and libraries: RE2, pybind11, and cmake installed in the build environment.

    • On Ubuntu/Debian: sudo apt-get install build-essential cmake ninja-build python3-dev cython3 pybind11-dev libre2-dev

    • On Gentoo, install dev-util/cmake, dev-python/pybind11, and dev-libs/re2

    • For a venv you can install the pybind11, cmake, and cython packages from PyPI

On MacOS, use the brew package manager:

$ brew install -s re2 pybind11

On Windows use the vcpkg package manager:

$ vcpkg install re2:x64-windows pybind11:x64-windows

You can pass some cmake environment variables to alter the build type or pass a toolchain file (the latter is required on Windows) or specify the cmake generator. For example:

$ CMAKE_GENERATOR="Unix Makefiles" CMAKE_TOOLCHAIN_FILE=clang_toolchain.cmake tox -e deploy

For development, get the source:

$ git clone git://github.com/tyteen4a03/pyre2.git
$ cd pyre2
$ make install

Platform-agnostic building with conda

An alternative to the above is provided via the conda recipe (use the miniconda installer if you don’t have conda installed already).

Backwards Compatibility

The stated goal of this module is to be a drop-in replacement for re, i.e.:

try:
    import re2 as re
except ImportError:
    import re

That being said, there are features of the re module that this module may never have; these will be handled through fallback to the original re module:

  • lookahead assertions (?!...)

  • backreferences (\\n in search pattern)

  • W and S not supported inside character classes

On the other hand, unicode character classes are supported (e.g., \p{Greek}). Syntax reference: https://github.com/google/re2/wiki/Syntax

However, there are times when you may want to be notified of a failover. The function set_fallback_notification determines the behavior in these cases:

try:
    import re2 as re
except ImportError:
    import re
else:
    re.set_fallback_notification(re.FALLBACK_WARNING)

set_fallback_notification takes three values: re.FALLBACK_QUIETLY (default), re.FALLBACK_WARNING (raise a warning), and re.FALLBACK_EXCEPTION (raise an exception).

Documentation

Consult the docstrings in the source code or interactively through ipython or pydoc re2 etc.

Unicode Support

Python bytes and unicode strings are fully supported, but note that RE2 works with UTF-8 encoded strings under the hood, which means that unicode strings need to be encoded and decoded back and forth. There are two important factors:

  • whether a unicode pattern and search string is used (will be encoded to UTF-8 internally)

  • the UNICODE flag: whether operators such as \w recognize Unicode characters.

To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass UTF-8 encoded bytes strings directly but still treat them as unicode:

In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e']
In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'))
Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']

However, note that the indices in Match objects will refer to the bytes string. The indices of the match in the unicode string could be computed by decoding/encoding, but this is done automatically and more efficiently if you pass the unicode string:

>>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
<re2.Match object; span=(10, 12), match='\xc3\xbc'>
>>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE)
<re2.Match object; span=(9, 10), match=u'\xfc'>

Finally, if you want to match bytes without regard for Unicode characters, pass bytes strings and leave out the UNICODE flag (this will cause Latin 1 encoding to be used with RE2 under the hood):

>>> re2.findall(br'.', b'\x80\x81\x82')
['\x80', '\x81', '\x82']

Performance

Performance is of course the point of this module, so it better perform well. Regular expressions vary widely in complexity, and the salient feature of RE2 is that it behaves well asymptotically. This being said, for very simple substitutions, I’ve found that occasionally python’s regular re module is actually slightly faster. However, when the re module gets slow, it gets really slow, while this module buzzes along.

In the below example, I’m running the data against 8MB of text from the colossal Wikipedia XML file. I’m running them multiple times, being careful to use the timeit module. To see more details, please see the performance script.

Test

Description

# total runs

re time(s)

re2 time(s)

% re time

regex time(s)

% regex time

Findall URI|Email

Find list of ‘([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)’

2

6.262

0.131

2.08%

5.119

2.55%

Replace WikiLinks

This test replaces links of the form [[Obama|Barack_Obama]] to Obama.

100

4.374

0.815

18.63%

1.176

69.33%

Remove WikiLinks

This test splits the data by the <page> tag.

100

4.153

0.225

5.43%

0.537

42.01%

Feel free to add more speed tests to the bottom of the script and send a pull request my way!

Current Status

The tests show the following differences with Python’s re module:

  • The $ operator in Python’s re matches twice if the string ends with \n. This can be simulated using \n?$, except when doing substitutions.

  • The pyre2 module and Python’s re may behave differently with nested groups. See tests/test_emptygroups.txt for the examples.

Please report any further issues with pyre2.

Tests

If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It’s actually really easy:

  • Come up with regular expression problems using the regular python ‘re’ module.

  • Write a session in python traceback format Example.

  • Replace your import re with import re2 as re.

  • Save it with as test_<name>.txt in the tests directory. You can comment on it however you like and indent the code with 4 spaces.

Credits

This code builds on the following projects (in chronological order):

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyre2-updated-0.3.8.tar.gz (1.9 MB view details)

Uploaded Source

Built Distributions

pyre2_updated-0.3.8-cp312-cp312-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.12 Windows x86-64

pyre2_updated-0.3.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

pyre2_updated-0.3.8-cp312-cp312-macosx_10_15_universal2.whl (703.2 kB view details)

Uploaded CPython 3.12 macOS 10.15+ universal2 (ARM64, x86-64)

pyre2_updated-0.3.8-cp311-cp311-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.11 Windows x86-64

pyre2_updated-0.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pyre2_updated-0.3.8-cp311-cp311-macosx_10_15_universal2.whl (709.0 kB view details)

Uploaded CPython 3.11 macOS 10.15+ universal2 (ARM64, x86-64)

pyre2_updated-0.3.8-cp310-cp310-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.10 Windows x86-64

pyre2_updated-0.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (998.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pyre2_updated-0.3.8-cp310-cp310-macosx_10_15_universal2.whl (708.1 kB view details)

Uploaded CPython 3.10 macOS 10.15+ universal2 (ARM64, x86-64)

pyre2_updated-0.3.8-cp39-cp39-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

pyre2_updated-0.3.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (998.7 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pyre2_updated-0.3.8-cp39-cp39-macosx_10_15_universal2.whl (708.4 kB view details)

Uploaded CPython 3.9 macOS 10.15+ universal2 (ARM64, x86-64)

pyre2_updated-0.3.8-cp38-cp38-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.8 Windows x86-64

pyre2_updated-0.3.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pyre2_updated-0.3.8-cp38-cp38-macosx_10_15_universal2.whl (708.0 kB view details)

Uploaded CPython 3.8 macOS 10.15+ universal2 (ARM64, x86-64)

pyre2_updated-0.3.8-cp37-cp37m-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.7m Windows x86-64

pyre2_updated-0.3.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (980.2 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file pyre2-updated-0.3.8.tar.gz.

File metadata

  • Download URL: pyre2-updated-0.3.8.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for pyre2-updated-0.3.8.tar.gz
Algorithm Hash digest
SHA256 6d6aaa2f41a085095993b2d09562511cf40d4aedfc5bd00d78f53112be051e19
MD5 c6a59df80bf750f4941dd6d738426d89
BLAKE2b-256 a4e14211855af96b4d37d8939a32947c6b7a1c31b61874a0c65444f413696504

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 648196c6fe7b115431f2bedc48660333a61f6628bde7efefc14098d67f86b7b4
MD5 31a334db047f86d8c3bbbc90649c2641
BLAKE2b-256 ec9bf67d7101f813a7b60dd3f3a8eacefe75e7259f763b9180b8667f2cf66675

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 53d27f552fac2149c9dbe45faf1f6236c55fba42ea579799a07cdc853e3732d6
MD5 1f17c34164ff583f122d07218c4dced2
BLAKE2b-256 1b978448afdb9e368113cc67872683c47f2d8dac1019f51e4d0b1349d1b55d67

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp312-cp312-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp312-cp312-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 4db83d0f148d91f9b67b71eb3fd04a7e1d09397e7ecea75972632cd46c27ba6e
MD5 87f5859abc672463a09ffd0d6aca07ff
BLAKE2b-256 d4e3fbd6596fbdd6c4a2fbd4352e8a3c4f175084e5aa5b7c1788eaec8e55c098

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 45c0940dbda4a2c45652e69ab52946c2395171f3e3a96bce456a8ddb45b337e1
MD5 a60af10cb5137c2ad133a478a8a427f5
BLAKE2b-256 de94075d152f801c1fcfd03d89f2a5542da92aaea9c4dabae8f402e0d9fd9579

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54de291bb8b3aa2223f864293e0a9e62ea2877fea72a632fe5cd36a60012f7bc
MD5 d403a8b052fa0045d3175f0789a105c6
BLAKE2b-256 f1efaf9f429e6e7e33acdd9aff912b00a7731ef4b6af486d7fa6e6a5a566b652

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp311-cp311-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp311-cp311-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 2bda9bf4d59568152e085450ffc1c08fcf659000d06766861f7ff340ba601c3e
MD5 152fe90b069854f83ae2c0c0d2c924f0
BLAKE2b-256 569da271d851420dd21a6f9e6e1817af7c5dfd705bbf36ceafb3477a0e8374e0

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b0820d8420caca762184f5eebe57631be5b985e405c53e807aac24897daa892a
MD5 bc66f0b25c1b5121bc3dd3c9bad3d6ad
BLAKE2b-256 d06bd2fd267fe910acb7e693d37598c4ce87dc07967d3a1e857452b35552a1ce

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ab79cd5c663d20eca146361f2b2b7cf23c631109e43064d4bb35a43cb0607ffd
MD5 b2ea3e7bfb97d1aaeaccb8de76eb53f8
BLAKE2b-256 601f9727f44e4a46ac6a090f44d5dfae8709ba2af707218bdd3be5d2983bedfa

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp310-cp310-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp310-cp310-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 65e67527eb472cd5045966df42414205cbfc187633a844ba7d9f59480f46e748
MD5 b79618d4915fb25c133fdc2a2496ec9a
BLAKE2b-256 cd8cf5c9fd8c3d466655eb2ee8defa9349c08eb9ef3d4172d6f4f29983ba1ac4

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 350be9580700b67af87f5227453d1123bc9f4513f0bcc60450574f1bc46cb24f
MD5 5eff49c15b244c70f3adb5b4be69d2b4
BLAKE2b-256 c4b464348c696e9d79f41b9b8199f90e25ef8238d4a0a297aef93cf213ab1695

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ff764613bd436689cf5e4d3c14a25cf8465474e1db9f2a39738bbf481dd07300
MD5 0a30fdef70ea828dc83c9f484d6dbad3
BLAKE2b-256 424e07687b0b9862b7d67205233b4bd80ae9bc47c59f02318bb6a556b404eeb5

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp39-cp39-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp39-cp39-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 99942f75a252691117880fc60941f20170d35bcb3ccb72aff9a1bbce951d4db0
MD5 7317bce7e47115f46f73505f87f6ce4c
BLAKE2b-256 3969f1d33ff272262cfec2cddec35f57ef82989c7b3372cb3aadbdf998f6ab83

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 407ca7082e2049aeae0b2c716cc53cd92217d17886694fe0047ce5d636161155
MD5 1fcc6c3da831983bb3e681fc15cebebc
BLAKE2b-256 5ff3b6e8f88e248e8740e4fbdce13b536438ff23d2b7f60caf9677bf49b95342

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8331fc7bb1fa57d2654046ed189631d70db1170d935264ce82a7291413ce60e4
MD5 8f557da39916bdc1f3aec2297a624cf9
BLAKE2b-256 30387bed2275d5328e735454ecf1e93135e30e6804231315e257957ad725e2b0

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp38-cp38-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp38-cp38-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 f1c406e30aed02699888ae6938b83058f0845650991cd97db09e1686ce8a181a
MD5 115011dc5c7291a147c4d143a8cb0456
BLAKE2b-256 1986802e0eacc6094d113e9a58c551eda1860e2ec55083636049969a9b69d830

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 daf366a83b70b08cc4c59477455a28b2a108484d282824d350217a8a7378d229
MD5 bf0fd091d70f09697c0c833f55b1004a
BLAKE2b-256 2d64b11e4c077cfda91141617e111c4e8fa434940f2260d58b9e390e837560aa

See more details on using hashes here.

File details

Details for the file pyre2_updated-0.3.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyre2_updated-0.3.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 802dec5801912c76b21dcc3d91810ae9ff0cc308e78fb0aa32d93e921783f5d8
MD5 994e14240fde8d042b7d35e7b3ea0ef6
BLAKE2b-256 68aea622bf00d0eac8aa5443f531d7c3cb7cd0df1bec8e87e7ff34ba5196aa63

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page