Utility to create dict representations of file system trees.
Project description
Quick Overview
Create a dictionary representation of a filesystem hierarchy.
Optionally report some stats on the hierarchy (esp size of each directory).
Overview
Given an <inputDir>, pftree will perform a recursive walk down the directory tree. For each directory that contains files, pftree will create a dictionary key of the directory path, and will store a list of filenames for the key value.
pftree in and of itself does not really do any work. It is a class/module that abstracts the internals for representing file system hierarchies in dictionary form to be used by other modules. As a convenience, however, the --stats or --statsReverse do provide a useful analog for sorted directory usage down a file system tree.
Several simple file and directory name filters can be applied which can facilitate the targetting of very specific elements in a file system tree.
The core the of the class is a tree_analysisApply() method, that accepts various kwargs. When called, this method will loop over the dictionary, and for each key (i.e. ‘path’) will execute a callback method. This callback is passed the dictionary value at that key (i.e. usually just the list of files) as well as all the kwargs passed to tree_analysisApply().
Installation
Dependencies
The following dependencies are installed on your host system/python3 virtual env (they will also be automatically installed if pulled from pypi):
pfmisc (various misc modules and classes for the pf* family of objects)
tqdm (console prettiness for progress bars)
Using PyPI
The best method of installing this script and all of its dependencies is by fetching it from PyPI
pip3 install pftree
Command line arguments
--inputDir <inputDir>
Input directory to examine. The downstream nested structure of this
directory is examined and recreated in the <outputDir>.
[--outputDir <outputDir>]
The directory to contain a tree structure identical to the input
tree structure, and which contains all output files from the
per-input-dir processing.
[--maxdepth <dirDepth>]
The maximum depth to descend relative to the <inputDir>. Note, that
this counts from zero! Default of '-1' implies transverse the entire
directory tree.
[--relativeDir]
A flag argument. If passed (i.e. True), then the dictionary key values
are taken to be relative to the <inputDir>, i.e. the key values
will not contain the <inputDir>; otherwise the key values will
contain the <inputDir>.
[--inputFile <inputFile>]
An optional <inputFile> specified relative to the <inputDir>. If
specified, then do not perform a directory walk, but target this
specific file.
[--fileFilter <someFilter1,someFilter2,...>]
An optional comma-delimated string to filter out files of interest
from the <inputDir> tree. Each token in the expression is applied in
turn over the space of files in a directory location according to a
logical operation, and only files that contain this token string in
their filename are preserved.
[--filteFilterLogic AND|OR]
The logical operator to apply across the fileFilter operation. Default
is OR.
[--dirFilter <someFilter1,someFilter2,...>]
An additional filter that will further limit any files to process to
only those files that exist in leaf directory nodes that have some
substring of each of the comma separated <someFilter> in their
directory name.
[--dirFilterLogic AND|OR]
The logical operator to apply across the dirFilter operation. Default
is OR.
[--outputLeafDir <outputLeafDirFormat>]
If specified, will apply the <outputLeafDirFormat> to the output
directories containing data. This is useful to blanket describe
final output directories with some descriptive text, such as
'anon' or 'preview'.
This is a formatting spec, so
--outputLeafDir 'preview-%%s'
where %%s is the original leaf directory node, will prefix each
final directory containing output with the text 'preview-' which
can be useful in describing some features of the output set.
[--threads <numThreads>]
If specified, break the innermost analysis loop into <numThreads>
threads. Please note the following caveats:
* Only thread if you have a high CPU analysis loop. Note that
the input file read and output file write loops are not
threaded -- only the analysis loop is threaded. Thus, if the
bulk of execution time is in file IO, threading will not
really help.
* Threading will change the nature of the innermost looping
across the problem domain, with the result that *all* of the
problem data will be read into memory! That means potentially
all the target input file data across the entire input directory
tree.
[--json]
If specified, do a JSON dump of the entire return payload.
[--followLinks]
If specified, follow symbolic links.
[--overwrite]
If specified, allow for overwriting of existing files
[--man]
Show full help.
[--synopsis]
Show brief help.
[--verbosity <level>]
Set the app verbosity level. This ranges from 0...<N> where internal
log messages with a level=<M> will only display if M <= N. In this
manner increasing the level here can be used to show more and more
debugging info, assuming that debug messages in the code have been
tagged with a level.
[--stats | --statsReverse | --du | --duf]
If specified, return some stats to caller. The amount of information
returned depends on the --verbosity.
For --stats (and --statsReverse):
* --verbosity 0: return only a final summary of group statistics
* --verbosity 1: in addition, return a sorted (by size) list of
subdirectories in the search tree
* --verbosity >1: same as above, but provide probing status updates.
NOTE: this incurs a significant performance penalty!
For --du | --duf
similar to '--stats' but return directory lists in a fashion similar
to the GNU 'du' tool. Both of these set default verbosity values so that
* --du : only provide a summary
* --duf: provide the (full) sorted list as well
[--3D]
A "toy" flag that simply shows the final stats report with an ASCII
3D effect.
[--jsonStats]
If specified, do a JSON dump of the stats.
[--syslog]
If specified, prepend output 'log' messages in syslog style.
[--test <analysisDelayLength[:<type>]>]
If specified, perform a test/dummy run through the
- read
- analyze
- write
callbacks. The <analysisDelayLength> denotes time (in seconds)
to delay in the analysis loop -- useful for testing threading
performance.
An optional [:<type>] can be specified.
:0 - write the 'l_file' to each outputdir, i.e. a simple 'ls'
analog
:1 - write only the number of files analyzed to each outputdir,
i.e. a summary.
For large trees, ':0' can take a significantly longer time than
':1'.
Examples
stats
Run on a target tree and output some detail and stats
pftree --inputDir /var/www/html \
--printElapsedTime \
--stats --verbosity 0
Increasing the verbosity will produce increasing output on the console. Passing a --json will return a highly detailed JSON payload with considerable information. Passing a --jsonStats will only return a summary of the final stats on the filesystem probed. Note that the --verbosity flag is ignored if --json or --jsonStats are also present.
pftree --duf --inputDir /var/www/html
Simpler CLI for a less “progress displaying” but faster response.
test
Run a test down a target tree:
pftree --inputDir /etc \
--outputDir /tmp/test \
--verbosity 1 --relativeDir \
--outputLeafDir 'preview-%%s' \
--test 0
which will “copy” the input tree to the output, and save a file-ls.txt in each directory where necessary. Note the -r for ‘relative’ directory specification and the --outputLeafDir spec.
_-30-_
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.