Skip to main content

Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.

Project description

multiregex

CI Documentation conda-forge pypi-version python-version

Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.

Introduction

See this introductory blog post.

Installation

You can install the package in development mode using:

git clone https://github.com/quantco/multiregex
cd multiregex

pixi run pre-commit-install
pixi run postinstall
pixi run test

Usage

import multiregex

# Create matcher from multiple regexes.
my_patterns = [r"\w+@\w+\.com", r"\w\.com"]
matcher = multiregex.RegexMatcher(my_patterns)

# Run `re.search` for all regexes.
# Returns a set of matches as (re.Pattern, re.Match) tuples.
matcher.search("john.doe@example.com")
# => [(re.compile('\\w+@\\w+\\.com'), <re.Match ... 'doe@example.com'>),
#     (re.compile('\\w+\\.com'), <re.Match ... 'example.com'>)]

# Same as above, but with `re.match`.
matcher.match(...)
# Same as above, but with `re.fullmatch`.
matcher.fullmatch(...)

Custom prematchers

To be able to quickly match many regexes against a string, multiregex uses "prematchers" under the hood. Prematchers are lists of non-regex strings of which at least one can be assumed to be present in the haystack if the corresponding regex matches. As an example, a valid prematcher of r"\w+\.com" could be [".com"] and a valid prematcher of r"(B|b)aNäNa" could be ["b"] or ["anäna"]. Note that prematchers must be all-lowercase (in order for multiregex to be able to support re.IGNORECASE).

You will likely have to provide your own prematchers for all but the simplest regex patterns:

multiregex.RegexMatcher([r"\d+"])
# => ValueError: Could not generate prematcher : '\\d+'

To provide custom prematchers, pass (pattern, prematchers) tuples:

multiregex.RegexMatcher([(r"\d+", map(str, range(10)))])

To use a mixture of automatic and custom prematchers, pass prematchers=None:

matcher = multiregex.RegexMatcher([(r"\d+", map(str, range(10))), (r"\w+\.com", None)])
matcher.prematchers
# => {(re.compile('\\d+'), {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}),
#     (re.compile('\\w+\\.com'), {'com'})}

Disabling prematchers

To disable prematching for certain pattern entirely (ie., always run the regex without first running any prematchers), pass an empty list of prematchers:

multiregex.RegexMatcher([(r"super complicated regex", [])])

Profiling prematchers

To check if your prematchers are effective, you can use the built-in prematcher "profiler":

yyyy_mm_dd = r"(19|20)\d\d-\d\d-\d\d"  # Default prematchers: {'-'}
matcher = multiregex.RegexMatcher([yyyy_mm_dd], count_prematcher_false_positives=True)
for string in my_benchmark_dataset:
    matcher.search(string)
print(matcher.format_prematcher_false_positives())
# => For example:
# FP count | FP rate | Pattern / Prematchers
# ---------+---------+----------------------
#      137 |    0.72 | (19|20)\d\d-\d\d-\d\d / {'-'}

In this example, there were 137 input strings that were matched positive by the prematcher but negative by the regex. In other words, the prematcher failed to prevent slow regex evaluation in 72% of the cases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiregex-2.0.3.tar.gz (50.8 kB view details)

Uploaded Source

Built Distribution

multiregex-2.0.3-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file multiregex-2.0.3.tar.gz.

File metadata

  • Download URL: multiregex-2.0.3.tar.gz
  • Upload date:
  • Size: 50.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for multiregex-2.0.3.tar.gz
Algorithm Hash digest
SHA256 875ecb79cada5ae4c5d26bad1eb3f82f003c3d1d22451363a01c98644a5bf20e
MD5 781871e73605750ee6020d3cfd3a4cc1
BLAKE2b-256 32a70f6da6a68ef59be423d1c96a87827a390a4c0e1e91669ce86d079cc5b66e

See more details on using hashes here.

File details

Details for the file multiregex-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: multiregex-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for multiregex-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3f774366375d4d45725e7c3f95328afd73810d206f3f51e232c3f31204e9a2ee
MD5 818618dbfeb1687248bf96f0a88ec39f
BLAKE2b-256 ec8252ae7b12652fc96b65418f6803da81ce658cef2ca916b46364770c830a5f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page