Skip to main content

Utilities for efficient glob matching using tries

Project description

glob-tries

Description

glob-tries provides two classes, GlobTrie and PathTrie, which use slightly modified trie datastructures to efficiently store and query collections of globs and paths. These can be used for efficient indexing and matching of file trees when you have multiple glob patterns that might match a file. It also provides consistent precedence rules.

Installation

pip install glob-tries
poetry add glob-tries

Usage

import glob_tries

GlobTrie

GlobTrie can be thought of a dict where objects can be put into the dict using shell-style wildcard paths.

This is helpful in certain scenarios when you must group file paths, or file-path-like strings, into a variety of sets based on a variety of glob patterns. For example, say you have the following rules:

  • All files in /foo/bar/baz are of group baz
  • All .yaml, .yml, or .json files in foo/*/baz are in group config
  • All other files in foo are in group foo
  • All .txt files not otherwise covered by another rule should be in group text

You can express this with:

from glob_tries import GlobTrie

trie = GlobTrie()

trie.augment("foo/bar/baz/**", "baz")
trie.augment("foo/*/baz/**/*.json", "config")
trie.augment("foo/*/baz/**/*.yaml", "config")
trie.augment("foo/*/baz/**/*.yml", "config")
trie.augment("foo/**", "foo")
trie.augment("**/*.txt", "text")

A call to trie.get with a path that matches these rules will return the correct group. Precedence is based on how "precise" a matching expression is; the matching expression will proceed left to right, trying more specific checks (single letters) before less specific checks (**). The order of evaluation is:

  1. Single letters, as well as [abc]-type groups
  2. [!abc]-type negative groups
  3. ? single-character wildcards
  4. * single-folder wildcards
  5. ** recursive wildcards

GlobTrie supports *, **, ?, [abc], and [!abc]-style shell globbing.

from glob_tries import GlobTrie

trie = GlobTrie()

trie.augment("foo", 1)
trie.augment("foo/*/bar", 2)
trie.augment("ba[rz]", 3)
trie.augment("ba[!m]", 4)
trie.augment("qu?z", 5)
trie.augment("spam/**/obj", 6)

trie.get("foo") # 1
trie.get("foobar") # None

trie.get("foo/baz/bar") # 2
trie.get("foo/egg/bar") # 2
trie.get("foo/egg/spam/bar") # None

trie.get("bar") # 3
trie.get("baz") # 3
trie.get("bam") # None
trie.get("bax") # 4

trie.get("quzz") # 5
trie.get("quaz") # 5
trie.get("quoz") # 5

trie.get("spam/obj") # 6
trie.get("spam/eggs/obj") # 6
trie.get("spam/ham/eggs/obj") # 6
trie.get("spam/ham/eggs/notobj") # None

PathTrie

PathTrie is the inverse of GlobTrie. It stores a list of files in a directory, or strings that are arranged like files in a directory, and lets you efficiently list all files that match an arbitrary glob pattern. (The actual memory representation of the files is somewhat inefficient due to unavoidable Python overhead. Since each "node" in the trie is a Python object, there is a significant amount of overhead, meaning in many cases storing the trie representation of a list of many paths can be less efficient than just storing the list. It's computationally much more efficient to query, though.) PathTrie supports the same set of characters and operators as GlobTrie.

from glob_tries import PathTrie

trie = PathTrie()

trie.augment("foo.py")
trie.augment("bar.py")
trie.augment("baz.py")
trie.augment("folder1/foo.py")
trie.augment("folder1/foo.yaml")
trie.augment("folder1/subfolder/foo.yaml")
trie.augment("folder2/foo.yaml")

trie.get_all_matches("foo.py")
# ["foo.py"]
trie.get_all_matches("ba[rz].py")
# ["bar.py", "baz.py"]
trie.get_all_matches("folder1/*")
# ["folder1/foo.py", "folder1/foo.yaml"]
trie.get_all_matches("folder1/**")
# ["folder1/foo.py", "folder1/foo.yaml", "folder1/subfolder/foo.yaml"]
trie.get_all_matches("folder1/**/*.yaml")
# ["folder1/foo.yaml", "folder1/subfolder/foo.yaml"]
trie.get_all_matches("**/*.yaml")
# ["folder1/foo.yaml", "folder2/foo.yaml", "folder1/subfolder/foo.yaml"]

Contributing

We welcome contributions from the open-source community. See CONTRIBUTING.md for details.

The project currently has exhaustive test coverage. New additions should include similarly exhaustive coverage. Bugfixes should include a test that catches the bug condition. Unit tests can be run with pytest:

pytest

There are multiple pre-commit hooks that enforce typechecking, code style guidelines, and linter guidelines. Install them before development:

poetry run pre-commit install

License

This library is licensed under the BSD 3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glob_tries-1.0.2.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

glob_tries-1.0.2-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file glob_tries-1.0.2.tar.gz.

File metadata

  • Download URL: glob_tries-1.0.2.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.7.16 Linux/5.15.160-104.158.amzn2.x86_64

File hashes

Hashes for glob_tries-1.0.2.tar.gz
Algorithm Hash digest
SHA256 7bb008ec5542eac607c82311064202db00a8c8231229f68191e0dbeec5d76d1f
MD5 5859a28d25cd7e510e8ff7e15573bfa4
BLAKE2b-256 b8f2bfa5ea3b8a88a8314dc44a738fb22d37fa00f5bba0eeecba3f0a501f8d18

See more details on using hashes here.

File details

Details for the file glob_tries-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: glob_tries-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.7.16 Linux/5.15.160-104.158.amzn2.x86_64

File hashes

Hashes for glob_tries-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 69c614bd11f3b933d622d66d8d944f0453542e4733406006370c357f74ab23f7
MD5 ef512680f210d7a96676341c5b72129a
BLAKE2b-256 2f4b2ac5e2f11dcabeaca0fbab07a2188af96a0a3ce04ef24b68dc4411c5f706

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page