Python wrapper for Google\'s RE2 using Cython
Project description
Summary
pyre2 is a Python extension that wraps Google’s RE2 regular expression library. The RE2 engine compiles (strictly) regular expressions to deterministic finite automata, which guarantees linear-time behavior.
Intended as a drop-in replacement for re. Unicode is supported by encoding to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given. For best performance, work with UTF-8 encoded bytes strings.
Installation
Normal usage for Linux/Mac/Windows:
$ pip install pyre2
Compiling from source
Requirements for building the C++ extension from the repo source:
A build environment with gcc or clang (e.g. sudo apt-get install build-essential)
Build tools and libraries: RE2, pybind11, and cmake installed in the build environment.
On Ubuntu/Debian: sudo apt-get install build-essential cmake ninja-build python3-dev cython3 pybind11-dev libre2-dev
On Gentoo, install dev-util/cmake, dev-python/pybind11, and dev-libs/re2
For a venv you can install the pybind11, cmake, and cython packages from PyPI
On MacOS, use the brew package manager:
$ brew install -s re2 pybind11
On Windows use the vcpkg package manager:
$ vcpkg install re2:x64-windows pybind11:x64-windows
You can pass some cmake environment variables to alter the build type or pass a toolchain file (the latter is required on Windows) or specify the cmake generator. For example:
$ CMAKE_GENERATOR="Unix Makefiles" CMAKE_TOOLCHAIN_FILE=clang_toolchain.cmake tox -e deploy
For development, get the source:
$ git clone git://github.com/andreasvc/pyre2.git $ cd pyre2 $ make install
Platform-agnostic building with conda
An alternative to the above is provided via the conda recipe (use the miniconda installer if you don’t have conda installed already).
Backwards Compatibility
The stated goal of this module is to be a drop-in replacement for re, i.e.:
try: import re2 as re except ImportError: import re
That being said, there are features of the re module that this module may never have; these will be handled through fallback to the original re module:
lookahead assertions (?!...)
backreferences (\\n in search pattern)
W and S not supported inside character classes
On the other hand, unicode character classes are supported (e.g., \p{Greek}). Syntax reference: https://github.com/google/re2/wiki/Syntax
However, there are times when you may want to be notified of a failover. The function set_fallback_notification determines the behavior in these cases:
try: import re2 as re except ImportError: import re else: re.set_fallback_notification(re.FALLBACK_WARNING)
set_fallback_notification takes three values: re.FALLBACK_QUIETLY (default), re.FALLBACK_WARNING (raise a warning), and re.FALLBACK_EXCEPTION (raise an exception).
Documentation
Consult the docstrings in the source code or interactively through ipython or pydoc re2 etc.
Unicode Support
Python bytes and unicode strings are fully supported, but note that RE2 works with UTF-8 encoded strings under the hood, which means that unicode strings need to be encoded and decoded back and forth. There are two important factors:
whether a unicode pattern and search string is used (will be encoded to UTF-8 internally)
the UNICODE flag: whether operators such as \w recognize Unicode characters.
To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass UTF-8 encoded bytes strings directly but still treat them as unicode:
In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE) Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e'] In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8')) Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']
However, note that the indices in Match objects will refer to the bytes string. The indices of the match in the unicode string could be computed by decoding/encoding, but this is done automatically and more efficiently if you pass the unicode string:
>>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE) <re2.Match object; span=(10, 12), match='\xc3\xbc'> >>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE) <re2.Match object; span=(9, 10), match=u'\xfc'>
Finally, if you want to match bytes without regard for Unicode characters, pass bytes strings and leave out the UNICODE flag (this will cause Latin 1 encoding to be used with RE2 under the hood):
>>> re2.findall(br'.', b'\x80\x81\x82') ['\x80', '\x81', '\x82']
Performance
Performance is of course the point of this module, so it better perform well. Regular expressions vary widely in complexity, and the salient feature of RE2 is that it behaves well asymptotically. This being said, for very simple substitutions, I’ve found that occasionally python’s regular re module is actually slightly faster. However, when the re module gets slow, it gets really slow, while this module buzzes along.
In the below example, I’m running the data against 8MB of text from the colossal Wikipedia XML file. I’m running them multiple times, being careful to use the timeit module. To see more details, please see the performance script.
Test |
Description |
# total runs |
re time(s) |
re2 time(s) |
% re time |
regex time(s) |
% regex time |
---|---|---|---|---|---|---|---|
Findall URI|Email |
Find list of ‘([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)’ |
2 |
6.262 |
0.131 |
2.08% |
5.119 |
2.55% |
Replace WikiLinks |
This test replaces links of the form [[Obama|Barack_Obama]] to Obama. |
100 |
4.374 |
0.815 |
18.63% |
1.176 |
69.33% |
Remove WikiLinks |
This test splits the data by the <page> tag. |
100 |
4.153 |
0.225 |
5.43% |
0.537 |
42.01% |
Feel free to add more speed tests to the bottom of the script and send a pull request my way!
Current Status
The tests show the following differences with Python’s re module:
The $ operator in Python’s re matches twice if the string ends with \n. This can be simulated using \n?$, except when doing substitutions.
The pyre2 module and Python’s re may behave differently with nested groups. See tests/test_emptygroups.txt for the examples.
Please report any further issues with pyre2.
Tests
If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It’s actually really easy:
Come up with regular expression problems using the regular python ‘re’ module.
Write a session in python traceback format Example.
Replace your import re with import re2 as re.
Save it with as test_<name>.txt in the tests directory. You can comment on it however you like and indent the code with 4 spaces.
Credits
This code builds on the following projects (in chronological order):
Google’s RE2 regular expression library: https://github.com/google/re2
Facebook’s pyre2 github repository: http://github.com/facebook/pyre2/
Mike Axiak’s Cython version of this: http://github.com/axiak/pyre2/ (seems not actively maintained)
This fork adds Python 3 support and other improvements.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file pyre2-0.3.6.tar.gz
.
File metadata
- Download URL: pyre2-0.3.6.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fe972c0cadec49a5a055690e5aa29f8aebaed0fa9b7d8d3530e33719b61f91c |
|
MD5 | 0a82005f47a2c4a34f2147594422228a |
|
BLAKE2b-256 | f671e38ed302e3a01df2e233e77802e7ec92436621893a12824ad2f0e388dca0 |
File details
Details for the file pyre2-0.3.6-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1efec117f2543b38adcbe038a2ae156eb91b6ed8a73c998c3752a766d6241075 |
|
MD5 | 7c641c5eb84a1341edab2b686c715f26 |
|
BLAKE2b-256 | b59adea24501b16481210ffa5335bf0daf2f4511b62a1d86146e41e9e8072f77 |
File details
Details for the file pyre2-0.3.6-cp39-cp39-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp39-cp39-manylinux2010_x86_64.whl
- Upload date:
- Size: 909.5 kB
- Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b87e9aeee74376210bd82c8328eb007b93378f3cd61fa6176161c3b9037e8474 |
|
MD5 | 9c0387eb2d4c6c655beca72a2e782837 |
|
BLAKE2b-256 | 74b2d19fa0269c8e5e68ce0398f02e12fbbafd24c28ae5e17eacbef6366be899 |
File details
Details for the file pyre2-0.3.6-cp39-cp39-manylinux2010_i686.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp39-cp39-manylinux2010_i686.whl
- Upload date:
- Size: 886.9 kB
- Tags: CPython 3.9, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3ae7b087abcbc4b910d535c2fb877ef452b61d2514a63fd15b8b020b51fe4b5 |
|
MD5 | 7349fbaa1276229ce51ad2ea14b6a500 |
|
BLAKE2b-256 | 6cacc082863bb08d62abf95aef89af4c8f87010b1b8087b6a5deafd7e2720c9b |
File details
Details for the file pyre2-0.3.6-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 303.0 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 617c4d75b41b34afe7590e144efad1c564a8b49a1e0827872afc2243b24beada |
|
MD5 | 256471ccaca9810cc97770f2e31f7a06 |
|
BLAKE2b-256 | f81eb80c85203e657fa123ec7322bf7f84eed16cca543eb46a442664e530f568 |
File details
Details for the file pyre2-0.3.6-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c398942c3467fe23b2dd4a11dd78da8aee774d0b481e84b1b208819ee724cca |
|
MD5 | bd5cc010aef7f727eafc3bb515956465 |
|
BLAKE2b-256 | c0cef6fae545cdee96cd382cd751e82ad4e3992bdce0e2d9d7c2b65e93ad9fac |
File details
Details for the file pyre2-0.3.6-cp38-cp38-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp38-cp38-manylinux2010_x86_64.whl
- Upload date:
- Size: 952.8 kB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc180989186f05b75020b53c79059c338e9e1940d325fc945c84aab2b5c57525 |
|
MD5 | bb0782b65b829bcf6a378087e8a4c6fc |
|
BLAKE2b-256 | 702a423fa95b1829527be3ffe031e03080bbee7989ea4778b207500756cb81d2 |
File details
Details for the file pyre2-0.3.6-cp38-cp38-manylinux2010_i686.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp38-cp38-manylinux2010_i686.whl
- Upload date:
- Size: 930.4 kB
- Tags: CPython 3.8, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 961020835a3b805eed51a082e5effdccb51979c4efef2a17f17122967cb4749a |
|
MD5 | 281fd77b59fd99ed90098953b452d252 |
|
BLAKE2b-256 | d81eb6285f702f74cd359325a6f8b48cd06d47e3748334e57fd4503a76710a32 |
File details
Details for the file pyre2-0.3.6-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 300.7 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebe92a3222f2f6f176eeb3859638734e4f9a82d5940ad7d6f0c1288153c70ce2 |
|
MD5 | 565169ef1e3561d57fe940ad214cd83e |
|
BLAKE2b-256 | 0f08bf349a74974da745592549d2dd8c69a66fb8c8d8a3fc8b8da8b4828419bb |
File details
Details for the file pyre2-0.3.6-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 608558276d3539002ad6300d0b0a2b0941577fdea009715ff4d31052e05cb409 |
|
MD5 | 74e9bc7555926d994ad06f6a41ebc63e |
|
BLAKE2b-256 | 984baf8bd3adf695e55f1b79006270e9ae7c1fa7cdb5ef6934b83e881f5a69fe |
File details
Details for the file pyre2-0.3.6-cp37-cp37m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp37-cp37m-manylinux2010_x86_64.whl
- Upload date:
- Size: 836.5 kB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3467dd9a4c8100f6406bc6277d945a13b7fd7c4426d2415564de1324b5db94f |
|
MD5 | 883263ee67552bab51ebdb14e07bedff |
|
BLAKE2b-256 | f8c3625745d9cd8d247cffe5f0ffc5c420fe42d87711e6897bd74fc55693a329 |
File details
Details for the file pyre2-0.3.6-cp37-cp37m-manylinux2010_i686.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp37-cp37m-manylinux2010_i686.whl
- Upload date:
- Size: 817.6 kB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3b45f789374d0f95866330fcd34bb6b93705e8f5c276d9d70d318a227ba5954 |
|
MD5 | ba7e6e86dfa73d6795360d79d2ee24ea |
|
BLAKE2b-256 | 2cde0360ffc2149ede84d0a1c4199616d5727dd38a32fdf0043507ced6965631 |
File details
Details for the file pyre2-0.3.6-cp37-cp37m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp37-cp37m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 298.1 kB
- Tags: CPython 3.7m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 225784d7bd905bc3e87d4bbcc6ac4087ccea8905dd657273fd71bfb113e50e82 |
|
MD5 | 97e821eafeabb4f2c34d74a7299ed7e1 |
|
BLAKE2b-256 | f34051be79e1756b0df7a1e71b7d5fcb52a358d700483f463862694e11deedd5 |
File details
Details for the file pyre2-0.3.6-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18cd5d76973ee57232a5d851489c202105e4752aee6dcbd38742c0475f3f1c4e |
|
MD5 | 26c7e78703fcc39db1e874e1fade67c9 |
|
BLAKE2b-256 | ee60151d40ee2987a92290c3a014e66a1601a91488aa1083a9b04e51f0a822f1 |
File details
Details for the file pyre2-0.3.6-cp36-cp36m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp36-cp36m-manylinux2010_x86_64.whl
- Upload date:
- Size: 843.4 kB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97de5d4cf7d8b9be7dbe0dc0941c4a6c1395fc598722d9644adc55427d3dd083 |
|
MD5 | 51365be088add569661e7ff59486f696 |
|
BLAKE2b-256 | 4c0e143bb11016481013baeb3155d0f62bb7925ef4e442775cc74168db8e4ca1 |
File details
Details for the file pyre2-0.3.6-cp36-cp36m-manylinux2010_i686.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp36-cp36m-manylinux2010_i686.whl
- Upload date:
- Size: 819.5 kB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 310d5c98495114692940ffa020aaeef1341427755b6ca5a17c63092060ed93dc |
|
MD5 | 63a90ef746fec0aeffa1a46c7320c73a |
|
BLAKE2b-256 | b62fd0aa660e24ae145457f65e23790555e85b72a0590495de4380036afd7809 |
File details
Details for the file pyre2-0.3.6-cp36-cp36m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: pyre2-0.3.6-cp36-cp36m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 297.9 kB
- Tags: CPython 3.6m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d8e550899886ee01f1b8149ba1c336e1c749cec2e33414815a76fb5649cdf67 |
|
MD5 | 276e489fb6c5ec76bd41e4bf6b4e6cfa |
|
BLAKE2b-256 | a77475d4ac90b684303cb6d97ab49ba5b2f858709d206d9e9e43596b34acb60b |