A convenient implementation of the Aho-Corasick algorithm to efficiently find multiple search patterns and process the matches
Project description
Introduction
Multimatcher is an implementation of the Aho-Corasick (Aho & Corasick 1975) search algorithm. It efficiently finds multiple keywords in an input string, without having to loop over the input string multiple times.
The rationale behind the Multimatcher is that most often we want to do something with the found matches, and the Multimatcher provides a flexible "replace" method that allows different use cases such as:
- find and delete
- find and replace
- tag with a global label (i.e. all matches get the same label)
- tag with custom label (i.e. each match gets its own label)
- count matches
When possible, it's recommended to set whole_words_only to True, which makes matching significantly faster.
Examples
Find and delete matches
from multimatcher import Multimatcher
mm = Multimatcher(separator=' ')
mm.set_replacement_text("") # matches will be deleted
mm.set_search_patterns(['a', 'b', 'c'])
mm.replace("x a y b z c") # produces "x y z"
Find and transform matches
from multimatcher import Multimatcher
mm = Multimatcher(separator=' ')
mm.set_replacement_method(lambda x: x.capitalize()) # matches will be capitalized
mm.set_search_patterns(['a', 'b', 'c'])
mm.replace("x a y b z c") # produces "x A y B z C"
Find and replace matches with the same label
from multimatcher import Multimatcher
mm = Multimatcher(separator=' ')
mm.set_replacement_text("0") # all matches will be replaced with 0
mm.set_search_patterns(['a', 'b', 'c'])
mm.replace("x a y b z c") # produces "x 0 y 0 z 0"
Find and replace matches with custom labels
from multimatcher import Multimatcher
mm = Multimatcher(separator=' ')
mm.set_replacement_map({"a": "1", "b": "2", "c": "3"}) # replaces a > 1, b > 2, c > 3
mm.set_search_patterns(['a', 'b', 'c'])
mm.replace("x a y b z c") # produces "x 1 y 2 z 3"
Find and replace matches with custom labels
from multimatcher import Multimatcher
mm = Multimatcher(separator='')
mm.set_search_patterns(['a', 'b', 'c'])
mm.count("aa xx bb yy cc zz") # produces {'a': 2, 'b': 2, 'c': 2}
References
Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333-340.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file multimatcher-0.0.3.tar.gz
.
File metadata
- Download URL: multimatcher-0.0.3.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c46f4a2b00dafd8be3e72b76fca2f22bfa7d27edfcf6e058ce006627a502e96a |
|
MD5 | bde7ed98f750bfe2330a32b235d6d42c |
|
BLAKE2b-256 | 407ddfd5cc6533139bfb5fd493973fb6b869ecc5adfbb3d17cdd87d54649d92c |
File details
Details for the file multimatcher-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: multimatcher-0.0.3-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a57dcee52fade48764486534688a7d639f5e05eeb9cca77783bfaf7968e90f5 |
|
MD5 | 8e05ee01360403524bc4c35202e2ac97 |
|
BLAKE2b-256 | b88780e4ac82e3c4ee59d2ed13e41b665ca494f8b2e5acecb1ab83c311f018d0 |