Efficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing
Project description
retrie
retrie offers fast methods to match and replace (sequences of) strings based on efficient Trie-based regex unions.
Trie
Instead of matching against a simple regex union, which becomes slow for large sets of words, a more efficient regex pattern can be compiled using a Trie structure:
from retrie.trie import Trie
trie = Trie()
assert trie.pattern() == ""
for term in ["abc", "foo", "abs"]:
trie.add(term)
assert trie.pattern() == "(?:ab[cs]|foo)" # equivalent to but faster than "(?:abc|abs|foo)"
trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"
trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"
trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"
Installation
This pure-Python, OS independent package is available on PyPI:
$ pip install retrie
Usage
The following objects are all subclasses of retrie.retrie.Retrie
, which handles filling the Trie and compiling the corresponding regex pattern.
Blacklist
The Blacklist
object can be used to filter out bad occurences in a test or a sequenxce of strings:
from retrie.retrie import Blacklist
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled # re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good foobar"
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled # re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good bar"
Whitelist
Similar methods are available for the Whitelist
object:
from retrie.retrie import Whitelist
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled # re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("good abc foobar")) == "abc"
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled # re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("good abc foobar")) == "abcfoo"
Replacer
The Replacer
object can search & replace occurrences of replacement_mapping.keys()
with corresponding values.
from retrie.retrie import Replacer
replacer = Replacer(
replacement_mapping=dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"])),
match_substrings=True,
)
replacer.compiled # re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"
replacer = Replacer(
replacement_mapping=dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"])),
match_substrings=False,
)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"
replacer = Replacer(
replacement_mapping=dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"])),
match_substrings=False,
re_flags=None,
)
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"
replacer = Replacer(
replacement_mapping=dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"])),
match_substrings=False,
word_boundary=" ",
)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"
Development
Create a virtual environment.
python -m venv .venv
source .venv/bin/activate
Get ready to develop:
make install
This is equivalent to the following steps:
-
Install pre-commit and other continous integration dependencies in order to make commits and run tests.
pip install -r requirements/ci.txt pre-commit install
-
With requirements installed,
make lint
andmake test
can now be run. There is alsomake clean
, andmake all
which runs all three. -
To import the package in the python environment, install the package (
-e
for editable installation, upon import, python will read directly from the repository).pip install -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for retrie-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c718bfcb0ce8de78bfa7c3e0eb1579edbf27cc96618475f86c69ef6049343474 |
|
MD5 | 697aee97656856845c2d19dd702f0b74 |
|
BLAKE2b-256 | d582c06a43ed10d6bc1c3b113d8c4b47c14f87a22352794bb4b0486f2646e32c |