Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.
Project description
multiregex
Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.
Introduction
See this introductory blog post.
Installation
You can install the package in development mode using:
git clone https://github.com/quantco/multiregex
cd multiregex
pixi run pre-commit-install
pixi run postinstall
pixi run test
Usage
import multiregex
# Create matcher from multiple regexes.
my_patterns = [r"\w+@\w+\.com", r"\w\.com"]
matcher = multiregex.RegexMatcher(my_patterns)
# Run `re.search` for all regexes.
# Returns a set of matches as (re.Pattern, re.Match) tuples.
matcher.search("john.doe@example.com")
# => [(re.compile('\\w+@\\w+\\.com'), <re.Match ... 'doe@example.com'>),
# (re.compile('\\w+\\.com'), <re.Match ... 'example.com'>)]
# Same as above, but with `re.match`.
matcher.match(...)
# Same as above, but with `re.fullmatch`.
matcher.fullmatch(...)
Custom prematchers
To be able to quickly match many regexes against a string, multiregex
uses
"prematchers" under the hood. Prematchers are lists of non-regex strings of which
at least one can be assumed to be present in the haystack if the corresponding regex matches.
As an example, a valid prematcher of r"\w+\.com"
could be [".com"]
and a valid
prematcher of r"(B|b)aNäNa"
could be ["b"]
or ["anäna"]
.
Note that prematchers must be all-lowercase (in order for multiregex
to be able to support re.IGNORECASE
).
You will likely have to provide your own prematchers for all but the simplest regex patterns:
multiregex.RegexMatcher([r"\d+"])
# => ValueError: Could not generate prematcher : '\\d+'
To provide custom prematchers, pass (pattern, prematchers)
tuples:
multiregex.RegexMatcher([(r"\d+", map(str, range(10)))])
To use a mixture of automatic and custom prematchers, pass prematchers=None
:
matcher = multiregex.RegexMatcher([(r"\d+", map(str, range(10))), (r"\w+\.com", None)])
matcher.prematchers
# => {(re.compile('\\d+'), {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}),
# (re.compile('\\w+\\.com'), {'com'})}
Disabling prematchers
To disable prematching for certain pattern entirely (ie., always run the regex without first running any prematchers), pass an empty list of prematchers:
multiregex.RegexMatcher([(r"super complicated regex", [])])
Profiling prematchers
To check if your prematchers are effective, you can use the built-in prematcher "profiler":
yyyy_mm_dd = r"(19|20)\d\d-\d\d-\d\d" # Default prematchers: {'-'}
matcher = multiregex.RegexMatcher([yyyy_mm_dd], count_prematcher_false_positives=True)
for string in my_benchmark_dataset:
matcher.search(string)
print(matcher.format_prematcher_false_positives())
# => For example:
# FP count | FP rate | Pattern / Prematchers
# ---------+---------+----------------------
# 137 | 0.72 | (19|20)\d\d-\d\d-\d\d / {'-'}
In this example, there were 137 input strings that were matched positive by the prematcher but negative by the regex. In other words, the prematcher failed to prevent slow regex evaluation in 72% of the cases.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file multiregex-2.0.3.tar.gz
.
File metadata
- Download URL: multiregex-2.0.3.tar.gz
- Upload date:
- Size: 50.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 875ecb79cada5ae4c5d26bad1eb3f82f003c3d1d22451363a01c98644a5bf20e |
|
MD5 | 781871e73605750ee6020d3cfd3a4cc1 |
|
BLAKE2b-256 | 32a70f6da6a68ef59be423d1c96a87827a390a4c0e1e91669ce86d079cc5b66e |
File details
Details for the file multiregex-2.0.3-py3-none-any.whl
.
File metadata
- Download URL: multiregex-2.0.3-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f774366375d4d45725e7c3f95328afd73810d206f3f51e232c3f31204e9a2ee |
|
MD5 | 818618dbfeb1687248bf96f0a88ec39f |
|
BLAKE2b-256 | ec8252ae7b12652fc96b65418f6803da81ce658cef2ca916b46364770c830a5f |