Skip to main content

A tree-based extension to pathlib

Project description

Version badge

License badge

pathtreelib

The pathtreelib module aim to provide simple tools to analyse files and directories. The files and directories are represented by nodes which are organized in a tree structure (linked).

Getting started

The main class is PathTree and provides 5 different categories of functionalities:

  • Generate the structure of PathNodes (at tree initialization)

  • Different types of iteration on the nodes (e.g. depth-first, breadth-first)

  • Computation of different types of custom properties (e.g. bottom-up, top-down)

  • Pruning of part of the trees

  • Exporting in csv and Excel

For the sake of the following example, suppose the directories are organized as follow.

docs/img/ptl_example.drawio.png

Generate the tree

The initialization of the tree generate all the nodes and requires only the root.

>>> from pathtreelib import PathTree
>>> tree = PathTree("gordon")

Node iterators

Classic breadth-first and depth-first iterators are implemented. Breadth-first is used as default iterator in __iter__.

>>> for node in tree.breadth_first_iter():
>>>     print(node.path.as_posix())
    gordon
    gordon/black mesa
    gordon/pictures
    gordon/black mesa/research.pdf
    gordon/black mesa/xen.txt
    gordon/pictures/equipment
    gordon/pictures/holiday
    gordon/pictures/equipment/crowbar.jpg
    gordon/pictures/equipment/gravitygun.png
>>> for node in tree.depth_first_iter():
>>>     print(node.path.as_posix())
    gordon
    gordon/black mesa
    gordon/black mesa/research.pdf
    gordon/black mesa/xen.txt
    gordon/pictures
    gordon/pictures/equipment
    gordon/pictures/equipment/crowbar.jpg
    gordon/pictures/equipment/gravitygun.png
    gordon/pictures/holiday

An additional iterator, based on breadth-first, allows to skip subtrees that does not satisfy a specific condition.

>>> valid_func = lambda node: node.path.property[PathTreeProperty.DEPTH] < 3
>>> for node in tree.validated_iter(valid_func):
>>>     print(node.path.as_posix())
    gordon
    gordon/black mesa
    gordon/black mesa/research.pdf
    gordon/black mesa/xen.txt
    gordon/pictures
    gordon/pictures/equipment
    gordon/pictures/holiday

Property computation

The supported properties are classified by the type of algorithm used for the computation. Currently the library supports three types of properties: bottom-up properties, top-down properties, and individual properties. The properties are computed on all the nodes and require computation functions.

Bottom-up properties

Bottom-up properties are computed on the children first and then on the parent, hence are computed from the leaves upward. An example of this kind of properties is the height.

The computation requires the function to compute the property on a leaf

>>> def leaf_func(leaf:PathNode) -> int:
>>>     return 0

and the function to compute the property on an inode, assuming it is already computed on its children,

>>> def inode_func(inode:PathNode, children:list[PathNode]) -> int:
>>>    children_min_height = min([child.property["height"] for child in children])
>>>    return children_min_height + 1

then the property computation function ensures that the property is computed on the nodes in the correct order.

>>> tree.compute_bottom_up_property("height", leaf_func, inode_func)
>>> print(tree.root.property["height"])
    2
Top-down properties

Top-down properties are computed on the parent first and then on the children, hence are computed from the root downward. An example of this kind of property is the depth.

The computation requires the function to compute the property on the root

>>> def root_func(root:PathNode) -> int:
>>>     return 0

and the function to compute the property on all other nodes, assuming it is already computed on its parent,

>>> def notroot_func(node:PathNode, parent:PathNode) -> int:
>>>     return 1 + parent.property["depth"]

then the property computation function ensures that the property is computed on the nodes in the correct order.

>>> tree.compute_top_down_property("depth", root_func, notroot_func)
>>> print(tree.root.property["depth"])
    0
Individual properties

Individual property are computed on a node independently from the others, hence the order of the computation is irrelevant. An example of this kind of property is a flag indicating if the node’s path is a directory.

The computation requires the function to compute the property on a generic node

>>> def property_func(node) -> bool:
>>>     return node.path.is_dir()

then the property computation function applies the property on all the nodes.

>>> tree.compute_individual_property("is_dir", property_func)
>>> print(tree.root.property["is_dir"])
    True
Property removal

The properties can also be removed from all nodes of the tree.

>>> tree.remove_property("height")

Pruning

The pruning allows to reduce the nodes in the tree by removing subtrees that do not satisfy a specific condition. The removal of the nodes can be physical, links removed from the nodes, or logical, properties added to the node.

An example of condition is a limit on the depth.

>>> keep_condition = lambda node: node.property[PathTreeProperty.DEPTH] <= 2

The logical pruning add the pruning propery but does not change the structure of the tree and is reversable.

>>> tree.logical_pruning(keep_condition=keep_condition)
>>> print(len(list(node for node in tree)))
    9
>>> print(len(list(node for node in tree if not node.property[PathTreeProperty.PRUNED])))
    7

The physical pruning removes the links between nodes, hence the structure of the tree changes irreversibely.

>>> tree.physical_pruning(keep_condition=keep_condition)
>>> print(len(list(node for node in tree)))
    7

Export

The tree can be exported in csv and Excel format allowing a set of customization. With basic setting, the export includes the path and the properties, however it is possible to filter on the properties and an the nodes.

>>> tree.to_csv(csvfile="test.csv", properties["heigth", "depth", "is_dir"])
>>> tree.to_excel(csvfile="test.xlsx", properties["heigth", "depth", "is_dir"])

Export

Analysis of the whole C volume in Windows to find the most space consuming folders with holiday pictures (folders with a large number of jpg files).

# Generate the whole tree (could take a while if the volume is large)
tree = PathTree("C:/")

# Compute on each node the number of picture in its subtree
tree.compute_bottom_up_property(
   "num_of_jpg",
   leaf_func=lambda leaf: 1 if leaf.path.suffix == '.jpg' else 0,
   inode_func=lambda inode, children: sum([child.property["num_of_jpg"] for child in children])
)

# Keep only the subtrees of interest (that have at least 10 pics)
tree.physical_pruning(keep_condition=lambda node: node.property["num_of_jpg"] > 10)

# The nodes that are now leaves are the photo directories we are looking for
tree.compute_individual_property("photo_dir", lambda node: len(node.children) == 0)

# Find the 3 largest photo directories
photo_dirs = [
   node
   for node in tree
   if node.property["photo_dir"]
]
photo_dirs.sort(key=node.property[PathTreeProperty.SIZE], reverse=True)
print("The 3 largest photo directories:", photo_dirs[:3])

# Export all the photo directories with the main information to csv
tree.to_csv(
   "large_photo_dirs",
   properties=["num_of_jpg", PathTreeProperty.SIZE],
   node_condition=lambda node: node.property["photo_dir"]
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathtreelib-0.2.0.tar.gz (14.5 kB view details)

Uploaded Source

File details

Details for the file pathtreelib-0.2.0.tar.gz.

File metadata

  • Download URL: pathtreelib-0.2.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for pathtreelib-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c251790c1efa8de01605790a5d734d02c8385090c2f8f6dc10aa0964b9e09b88
MD5 043983cfa96bc43876ef4e63edc417da
BLAKE2b-256 65bad306d3f132b3b394cab0b59371948676fae3685a2d6fdc67f695cb84e54f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page