Skip to main content

Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.

Project description

multiregex

CI conda-forge pypi-version python-version

Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.

Introduction

See this introductory blog post.

Installation

This project is managed by pixi. You can install the package in development mode using:

git clone https://github.com/quantco/multiregex
cd multiregex

pixi run pre-commit-install
pixi run postinstall
pixi run test

Usage

import multiregex

# Create matcher from multiple regexes.
my_patterns = [r"\w+@\w+\.com", r"\w\.com"]
matcher = multiregex.RegexMatcher(my_patterns)

# Run `re.search` for all regexes.
# Returns a set of matches as (re.Pattern, re.Match) tuples.
matcher.search("john.doe@example.com")
# => [(re.compile('\\w+@\\w+\\.com'), <re.Match ... 'doe@example.com'>),
#     (re.compile('\\w+\\.com'), <re.Match ... 'example.com'>)]

# Same as above, but with `re.match`.
matcher.match(...)
# Same as above, but with `re.fullmatch`.
matcher.fullmatch(...)

Custom prematchers

To be able to quickly match many regexes against a string, multiregex uses "prematchers" under the hood. Prematchers are lists of non-regex strings of which at least one can be assumed to be present in the haystack if the corresponding regex matches. As an example, a valid prematcher of r"\w+\.com" could be [".com"] and a valid prematcher of r"(B|b)aNäNa" could be ["b"] or ["anäna"]. Note that prematchers must be all-lowercase (in order for multiregex to be able to support re.IGNORECASE).

You will likely have to provide your own prematchers for all but the simplest regex patterns:

multiregex.RegexMatcher([r"\d+"])
# => ValueError: Could not generate prematcher : '\\d+'

To provide custom prematchers, pass (pattern, prematchers) tuples:

multiregex.RegexMatcher([(r"\d+", map(str, range(10)))])

To use a mixture of automatic and custom prematchers, pass prematchers=None:

matcher = multiregex.RegexMatcher([(r"\d+", map(str, range(10))), (r"\w+\.com", None)])
matcher.prematchers
# => {(re.compile('\\d+'), {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}),
#     (re.compile('\\w+\\.com'), {'com'})}

Disabling prematchers

To disable prematching for certain pattern entirely (ie., always run the regex without first running any prematchers), pass an empty list of prematchers:

multiregex.RegexMatcher([(r"super complicated regex", [])])

Profiling prematchers

To check if your prematchers are effective, you can use the built-in prematcher "profiler":

yyyy_mm_dd = r"(19|20)\d\d-\d\d-\d\d"  # Default prematchers: {'-'}
matcher = multiregex.RegexMatcher([yyyy_mm_dd], count_prematcher_false_positives=True)
for string in my_benchmark_dataset:
    matcher.search(string)
print(matcher.format_prematcher_false_positives())
# => For example:
# FP count | FP rate | Pattern / Prematchers
# ---------+---------+----------------------
#      137 |    0.72 | (19|20)\d\d-\d\d-\d\d / {'-'}

In this example, there were 137 input strings that were matched positive by the prematcher but negative by the regex. In other words, the prematcher failed to prevent slow regex evaluation in 72% of the cases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiregex-2.0.4.tar.gz (79.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multiregex-2.0.4-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file multiregex-2.0.4.tar.gz.

File metadata

  • Download URL: multiregex-2.0.4.tar.gz
  • Upload date:
  • Size: 79.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for multiregex-2.0.4.tar.gz
Algorithm Hash digest
SHA256 2d79c21660973acfe4cc8ef4dd18e44a863ae617c8e9c3ca431a2941fe5afb6b
MD5 466059266599d963b2e185c652d01ec4
BLAKE2b-256 335479fcb4b4fcd8be23c11eb941af4ba1319d006fafc2c64c567ea033c0e901

See more details on using hashes here.

Provenance

The following attestation bundles were made for multiregex-2.0.4.tar.gz:

Publisher: build.yml on Quantco/multiregex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file multiregex-2.0.4-py3-none-any.whl.

File metadata

  • Download URL: multiregex-2.0.4-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for multiregex-2.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7ece35bc815bebf46cfc06774b2fc3d737f1acb43f642522d58adc08f20e2e0d
MD5 13d9738b67c2cc0025f9193a246413d6
BLAKE2b-256 aacb1afdab5fbdc011bed45a2b9aa3ebe3cab9e3155c5fc8d2d5addd58bb1eb9

See more details on using hashes here.

Provenance

The following attestation bundles were made for multiregex-2.0.4-py3-none-any.whl:

Publisher: build.yml on Quantco/multiregex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page