Skip to main content

Search a string for multiple substrings at once

Project description

ahocorasick_rs: Quickly search for multiple substrings at once

ahocorasick_rs allows you to search for multiple substrings ("patterns") in a given string ("haystack") using variations of the Aho-Corasick algorithm.

In particular, it's implemented as a wrapper of the Rust aho-corasick library, and provides a faster alternative to the pyahocorasick library.

Found any problems or have any questions? File an issue on the GitHub project.

Quickstart

The ahocorasick_rs library allows you to search for multiple strings ("patterns") within a haystack. For example, let's install the library:

$ pip install ahocorasick-rs

Then, we can construct a AhoCorasick object:

>>> import ahocorasick_rs
>>> patterns = ["hello", "world", "fish"]
>>> haystack = "this is my first hello world. hello!"
>>> ac = ahocorasick_rs.AhoCorasick(patterns)

You can construct a AhoCorasick object from any iterable (including generators), not just lists:

>>> ac = ahocorasick_rs.AhoCorasick((p.lower() for p in patterns))

AhoCorasick.find_matches_as_indexes() returns a list of tuples, each tuple being:

  1. The index of the found pattern inside the list of patterns.
  2. The start index of the pattern inside the haystack.
  3. The end index of the pattern inside the haystack.
>>> ac.find_matches_as_indexes(haystack)
[(0, 17, 22), (1, 23, 28), (0, 30, 35)]
>>> patterns[0], patterns[1], patterns[0]
('hello', 'world', 'hello')
>>> haystack[17:22], haystack[23:28], haystack[30:35]
('hello', 'world', 'hello')

find_matches_as_strings() returns a list of found patterns:

>>> ac.find_matches_as_strings(haystack)
['hello', 'world', 'hello']

Choosing the matching algorithm

Match kind

There are three ways you can configure matching in cases where multiple patterns overlap. For a more in-depth explanation, see the underlying Rust library's documentation of matching.

Assume we have this starting point:

>>> from ahocorasick_rs import AhoCorasick, MatchKind

Standard (the default)

This returns the pattern that matches first, semantically-speaking. This is the default matching pattern.

>>> ac AhoCorasick(["disco", "disc", "discontent"])
>>> ac.find_matches_as_strings("discontent")
['disc']
>>> ac = AhoCorasick(["b", "abcd"])
>>> ac.find_matches_as_strings("abcdef")
['b']

In this case disc will match before disco or discontent.

Similarly, b will match before abcd because it ends earlier in the haystack than abcd does:

>>> ac = AhoCorasick(["b", "abcd"])
>>> ac.find_matches_as_strings("abcdef")
['b']

LeftmostFirst

This returns the leftmost-in-the-haystack matching pattern that appears first in the list of given patterns. That means the order of patterns makes a difference:

>>> ac = AhoCorasick(["disco", "disc"], matchkind=MatchKind.LeftmostFirst)
>>> ac.find_matches_as_strings("discontent")
['disco']
>>> ac = AhoCorasick(["disc", "disco"], matchkind=MatchKind.LeftmostFirst)
['disc']

Here we see abcd matched first, because it starts before b:

>>> ac = AhoCorasick(["b", "abcd"], matchkind=MatchKind.LeftmostFirst)
>>> ac.find_matches_as_strings("abcdef")
['abcd']
LeftmostLongest

This returns the leftmost-in-the-haystack matching pattern that is longest:

>>> ac = AhoCorasick(["disco", "disc", "discontent"], matchkind=MatchKind.LeftmostLongest)
>>> ac.find_matches_as_strings("discontent")
['discontent']

Overlapping matches

You can get all overlapping matches, instead of just one of them, but only if you stick to the default matchkind, MatchKind.Standard:

>>> from ahocorasick_rs import AhoCorasick
>>> patterns = ["winter", "onte", "disco", "discontent"]
>>> ac = AhoCorasick(patterns)
>>> ac.find_matches_as_strings("discontent", overlapping=True)
['disco', 'onte', 'discontent']

Additional configuration: speed and memory usage tradeoffs

Algorithm implementations: trading construction speed, memory, and performance

You can choose the type of underlying automaton to use, with different performance tradeoffs. The short version: if you want maximum matching speed, and you don't have too many patterns, try the Implementation.DFA implementation and see if it helps.

The underlying Rust library supports four choices, which are exposed as follows:

  • None uses a heuristic to choose the "best" Aho-Corasick implementation for the given patterns, balancing construction time, memory usage, and matching speed. This is the default.
  • Implementation.NoncontiguousNFA: A noncontiguous NFA is the fastest to be built, has moderate memory usage and is typically the slowest to execute a search.
  • Implementation.ContiguousNFA: A contiguous NFA is a little slower to build than a noncontiguous NFA, has excellent memory usage and is typically a little slower than a DFA for a search.
  • Implementation.DFA: A DFA is very slow to build, uses exorbitant amounts of memory, but will typically execute searches the fastest.
>>> from ahocorasick_rs import AhoCorasick, Implementation
>>> ac = AhoCorasick(["disco", "disc"], implementation=Implementation.DFA)

Trading memory for speed

If you use find_matches_as_strings(), there are two ways strings can be constructed: from the haystack, or by caching the patterns on the object. The former takes more work, the latter uses more memory if the patterns would otherwise have been garbage-collected. You can control the behavior by using the store_patterns keyword argument to AhoCorasick().

  • AhoCorasick(..., store_patterns=None): The default. Use a heuristic (currently, whether the total of pattern string lengths is less than 4096 characters) to decide whether to store patterns or not.
  • AhoCorasick(..., store_patterns=True): Keep references to the patterns, potentially speeding up find_matches_as_strings() at the cost of using more memory. If this uses large amounts of memory this might actually slow things down due to pressure on the CPU memory cache, and/or the performance benefit might be overwhelmed by the algorithm's search time.
  • AhoCorasick(..., store_patterns=False): Don't keep references to the patterns, saving some memory but potentially slowing down find_matches_as_strings(), especially when there are only a small number of patterns and you are searching a small haystack.

Implementation details

  • Matching releases the GIL, to enable concurrency.
  • Not all features from the underlying library are exposed; if you would like additional features, please file an issue or submit a PR.

Benchmarks

As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!

That being said, I've seen ahocorasick_rs run 1.5× to 7× as fast as pyahocorasick, depending on the options used. You can run the included benchmarks, if you want, to see some comparative results locally. Clone the repository, then:

pip install pytest-benchmark ahocorasick_rs pyahocorasick
pytest benchmarks/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ahocorasick_rs-0.16.0.tar.gz (66.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ahocorasick_rs-0.16.0-cp311-none-win_amd64.whl (248.3 kB view details)

Uploaded CPython 3.11Windows x86-64

ahocorasick_rs-0.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

ahocorasick_rs-0.16.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (693.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

ahocorasick_rs-0.16.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (711.4 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)macOS 10.9+ x86-64macOS 11.0+ ARM64

ahocorasick_rs-0.16.0-cp311-cp311-macosx_10_7_x86_64.whl (380.5 kB view details)

Uploaded CPython 3.11macOS 10.7+ x86-64

ahocorasick_rs-0.16.0-cp310-none-win_amd64.whl (248.3 kB view details)

Uploaded CPython 3.10Windows x86-64

ahocorasick_rs-0.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

ahocorasick_rs-0.16.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (693.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

ahocorasick_rs-0.16.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (711.4 kB view details)

Uploaded CPython 3.10macOS 10.9+ universal2 (ARM64, x86-64)macOS 10.9+ x86-64macOS 11.0+ ARM64

ahocorasick_rs-0.16.0-cp310-cp310-macosx_10_7_x86_64.whl (380.5 kB view details)

Uploaded CPython 3.10macOS 10.7+ x86-64

ahocorasick_rs-0.16.0-cp39-none-win_amd64.whl (248.5 kB view details)

Uploaded CPython 3.9Windows x86-64

ahocorasick_rs-0.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.6 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

ahocorasick_rs-0.16.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (694.3 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

ahocorasick_rs-0.16.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (712.6 kB view details)

Uploaded CPython 3.9macOS 10.9+ universal2 (ARM64, x86-64)macOS 10.9+ x86-64macOS 11.0+ ARM64

ahocorasick_rs-0.16.0-cp39-cp39-macosx_10_7_x86_64.whl (381.2 kB view details)

Uploaded CPython 3.9macOS 10.7+ x86-64

ahocorasick_rs-0.16.0-cp38-none-win_amd64.whl (248.2 kB view details)

Uploaded CPython 3.8Windows x86-64

ahocorasick_rs-0.16.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.6 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

ahocorasick_rs-0.16.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (694.6 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ ARM64

ahocorasick_rs-0.16.0-cp38-cp38-macosx_10_7_x86_64.whl (381.4 kB view details)

Uploaded CPython 3.8macOS 10.7+ x86-64

File details

Details for the file ahocorasick_rs-0.16.0.tar.gz.

File metadata

  • Download URL: ahocorasick_rs-0.16.0.tar.gz
  • Upload date:
  • Size: 66.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for ahocorasick_rs-0.16.0.tar.gz
Algorithm Hash digest
SHA256 a9012c15e2d66fdbacf00660909a1776191bcd5f29efc38526d7d9b86a58f14c
MD5 196a850630d2d6fa8ed5a117ed405cf9
BLAKE2b-256 845cfea759e2d0572c6053d83ce23f69bfccf31f932161be0ccba84211cab65c

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 c6f97536184299cf2eac326a7c4edf94d40f2267b71e3d20fd29ce948aa54cc6
MD5 47236581ab609d56c73bada189fe3f83
BLAKE2b-256 c1a8f3f19356e9b4e1835e22b874a83c6bcd7b310e12107ae8fe19d2807feac5

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c2c8eca576b4911f37a99cabcc339debca21a28dc77ad9fc5b11cc06bf56646a
MD5 aca09c5f37e8823919c9367684adb17e
BLAKE2b-256 3c54266e8585b0ae3810e159e770077b8d6802f3fb8b72b97c70c26bafda1f16

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c6ea363d9d8051291451d9d179071897e5530c252ca7defd35f75106385822af
MD5 2fa3e3078466c1708478e9d60eeda46d
BLAKE2b-256 bb191e268073e71c02a298c36b071331c08111565b533a59228d0f50857514c5

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 ddc54abeb4bff92b97da91ee1e13b42dc406681ce5185de4dbda30af38aa1151
MD5 9fe599c96d206a060d9eb4829285d904
BLAKE2b-256 1bca9fc1f70c8e9caacc03c9ce59307d534e276494fb9122ec3242bd50c1965c

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp311-cp311-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp311-cp311-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 b53dcbd82860beff99052b6b7b615bfac953c96104c4b8c2f2728bc4917545b8
MD5 d28a0798d6788577287f91b52dfb22c4
BLAKE2b-256 2a3de30eab20c554155bf5c280dd5607b1d7d033bb506de7810133751c5d7247

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 80eb243e1216da60083db0c309012ff5a992fbb9c6ba981a1efcb5f078f6ac59
MD5 65f02b1822e207a48cf665f34cb9bf32
BLAKE2b-256 a67a5a53775c8a198ecbcb2ed3c6b43f195ab8a88b4ff5f2f4ca2bf3d9852178

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1cf0ac8f24abb754365f306255fbda43ffdeff1335075fa8533a9457fd5991b0
MD5 270875318827cdbd8d65419474e23601
BLAKE2b-256 fea22ac1fc9a7d1603cb7524b38d88da319d7e969de25f906e901620dea9b8a7

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2a6a42d8dc429ef5755f16a79bd0f0f1dde7bf89c132474f075281f83c3f3b09
MD5 ddf024a9ced1f5c4c485d3b0d35ffbf1
BLAKE2b-256 d85743d9c17b60fae650a13ae81bd30c7638895decc65aa5577cb10205fa0882

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 11dbf23d76950ddad3eae03d72cf74f8e708d105a399b091015d2e7c0b862a59
MD5 d76c74040b28d22143fecfa0580d1de1
BLAKE2b-256 e5a7cffe410ddb75c792fa94267e3d3a502772373583ad661342a73f0d06fe70

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp310-cp310-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 a36727e171641cbe29a76d1d1810bd76fdbac3740d165b61c877796b987ae896
MD5 e054b2d61d5df0c3b37ead12ee820811
BLAKE2b-256 d69bcc56edabde67b01e0b1a22b236eef01db0b51b495e103bb04d9a6a9e3ed8

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 45502664a663bfa9409afd51cfd2cf91b0585f65df6cee1856d260b89331050f
MD5 8c2920bf048a1f1f9d0a7d53d43cada3
BLAKE2b-256 76a2513c6cbdbbc2236fefbb91eee533714f7678878e24682325807189562978

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 422524709f5ae829db17fe8ba73f049d25b4ece3d3dc64cf0ff33833765b4d49
MD5 17a06fe9b5bc23ca2cfd0c557205861d
BLAKE2b-256 5afc3eb835e3793bd01b6d5aaf27cad4d3cae967b9da25ecda2694dcac854eaa

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1c390d37f5a181b60ae7604e71a3707874e7882aecc2970cf82b44cbd6be278e
MD5 6830f3409d977d15930114aefd65f9ca
BLAKE2b-256 551bce160e54093dcb4410f2715acb1505653d332a961ba97a4f89b30d8c1b75

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 ab50ddad2ce3254a1e5a1dfc292d511be2b3c8fb098f6dd40c1e67304b702029
MD5 de35140d13d462a7e14f29bd4ecae12d
BLAKE2b-256 602f511974d94ad8d6a036df9606693cbce23eec649c7f6e37bdbb501e2b47c1

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 2dbe338caae36d5204a3fbec5450b4686c83f1f683b912fdeb68139c4657ab7d
MD5 860b615f0d62b6c0d9c76edccd0547a9
BLAKE2b-256 8c31c572fce53a9b080637ae602849a3f97574647614339940a406195e2e699a

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 1f14073cda9d641a958f6c91b626fd71f7652bc6bc30b338a15a8a4fc896313c
MD5 b3209619efd02a6f270361d823ae2b39
BLAKE2b-256 40da72f8a84d5d3be7f296f819db2453e65b1cc6bab80ea97f6266d0af11830c

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 46bb56bb5438377ecbad9833df67f66c896c9d39a53e121dad87bdc06369c7fd
MD5 65fc184a66ccf19713dee4bf7e5065c7
BLAKE2b-256 dff060cea42a59cc29d833939eb9146966ead10f2595fbeddb47bedfe7df3e44

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4059d4c84e06445cc5504930e51f72d264c9dfcd505ecc5473c424106191b6d6
MD5 f23c97ce28f3897556034ce2a8449a5e
BLAKE2b-256 86b7be0f3111483021eb88b7b504368437f41b46455a4225805e7c12f7ee220c

See more details on using hashes here.

File details

Details for the file ahocorasick_rs-0.16.0-cp38-cp38-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for ahocorasick_rs-0.16.0-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 7cd7b9b3ef97b8e0992258d53efc3a37e21ff986981fc0745ded3f0bd5fea49f
MD5 73364abcc00e8548fa17cbe7962f8c6e
BLAKE2b-256 75d6409f159e0a677dec7cd2def3b10611171e67aae707977c94ba9790ccdd82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page