Skip to main content

Efficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing

Project description

retrie

build codecov pypi Version python downloads black

retrie offers fast methods to match and replace (sequences of) strings based on efficient Trie-based regex unions.

Trie

Instead of matching against a simple regex union, which becomes slow for large sets of words, a more efficient regex pattern can be compiled using a Trie structure:

from retrie.trie import Trie


trie = Trie()

trie.add("abc", "foo", "abs")
assert trie.pattern() == "(?:ab[cs]|foo)"  # equivalent to but faster than "(?:abc|abs|foo)"

trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"

trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"

trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"

A Trie may be populated with zero or more strings at instantiation or via Trie.add, from which method chaining is possible. Two instances can be merged with the + (new instance) and += (in-place update) operators. Instances will compare equal if their data dictionaries are equal.

trie = Trie()
trie += Trie("abc")
assert (
    trie + Trie().add("foo")
    == Trie("abc", "foo")
    == Trie(*["abc", "foo"])
    == Trie().add(*["abc", "foo"])
    == Trie().add("abc", "foo")
    == Trie().add("abc").add("foo")
)

Installation

This pure-Python, OS independent package is available on PyPI:

$ pip install retrie

Usage

readthedocs

For documentation, see retrie.readthedocs.io.

The following objects are all subclasses of retrie.retrie.Retrie, which handles filling the Trie and compiling the corresponding regex pattern.

Blacklist

The Blacklist object can be used to filter out bad occurences in a text or a sequence of strings:

from retrie.retrie import Blacklist

# check out docstrings and methods
help(Blacklist)

blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good  foobar"

blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good  bar"

Whitelist

Similar methods are available for the Whitelist object:

from retrie.retrie import Whitelist

# check out docstrings and methods
help(Whitelist)

whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("bad abc foobar")) == "abc"

whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("bad abc foobar")) == "abcfoo"

Replacer

The Replacer object does a fast single-pass search & replace for occurrences of replacement_mapping.keys() with corresponding values.

from retrie.retrie import Replacer

# check out docstrings and methods
help(Replacer)

replacement_mapping = dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"]))

replacer = Replacer(replacement_mapping, match_substrings=True)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"

replacer = Replacer(replacement_mapping, match_substrings=False)
replacer.compiled
# re.compile(r'\b(?:ab[cs]|foo)\b', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"

replacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)
replacer.compiled  # on py3, re.UNICODE is always enabled
# re.compile(r'\b(?:ab[cs]|foo)\b')
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"

replacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=" ")
replacer.compiled
# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"

Development

gitmoji pre-commit

Run make help for options like installing for development, linting and testing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrie-0.3.1.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

retrie-0.3.1-py2.py3-none-any.whl (9.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file retrie-0.3.1.tar.gz.

File metadata

  • Download URL: retrie-0.3.1.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for retrie-0.3.1.tar.gz
Algorithm Hash digest
SHA256 a18e7a3a75b2671e72a6ab5f76116ce8c2a7dabc61d683d796307d0909c122e3
MD5 7263dd12d30986e2562fdc9f20dda5d9
BLAKE2b-256 51bb09905b04e64a992284d2e1b907be2f8ef293a3a0855a7db6948235493380

See more details on using hashes here.

File details

Details for the file retrie-0.3.1-py2.py3-none-any.whl.

File metadata

  • Download URL: retrie-0.3.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for retrie-0.3.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cb67a394f93e66d3ee35592c0c65192f8c1d7f36e6a98e73306e439e97bf8d9d
MD5 75ad52579cbdbc53dea3a9389a8bc7f3
BLAKE2b-256 bb1a0ab5362044d27c3639befa0ffb1c37bc6f1df0034cb7bd705aa4a8882264

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page