Skip to main content

Run arbitrary CLI on each nested dir of an inputdir

Project description

https://badge.fury.io/py/pfdo_run.svg https://travis-ci.org/FNNDSC/pfdo_run.svg?branch=master https://img.shields.io/badge/python-3.5%2B-blue.svg

Overview

pfdo_run provides a powerful mechanism for exploring an input directory space for files and directories of interest, and applying a user specified CLI in the space of each hit. Outputs are saved typically in an output tree that mirrors the input tree.

Internally, pfdo_run leverages the pftree infrastructure to perform the space exploration and allows for callback methods to be applied at each stage of read, analyze and write for valid target hits.

Addtionally, pfdo_run can apply some additional functions to its hits such as md5 hashing, string replacement, extension removal and more. See below for more detail.

Installation

Dependencies

The following dependencies are installed on your host system/python3 virtual env (they will also be automatically installed if pulled from pypi):

  • pfmisc (various misc modules and classes for the pf* family of objects)

  • pftree (create a dictionary representation of a filesystem hierarchy)

  • pfdo (the base module that does the core interfacing with pftree)

Using PyPI

The best method of installing this script and all of its dependencies is by fetching it from PyPI

pip3 install pfdo_run

CLI specification

Any text in the CLI prefixed with a percent char % is interpreted in one of two ways.

First, any CLI to the pfdo_run itself can be accessed via %. Thus, for example a %outputDir in the --exec string will be expanded to the outputDir of the pfdo_run.

Secondly, three internal ‘%’ variables are available:

  • %inputWorkingDir - the current input tree working directory

  • %outputWorkingDir - the current output tree working directory

  • %inputWorkingFile - the current file being processed

These internal variables allow for contextual specification of values. For example, a simple CLI touch command could be specified as

--exec "touch %outputWorkingDir/%inputWorkingFile"

or a command to convert an input png to an output jpg using the ImageMagick convert utility

--exec "convert %inputWorkingDir/%inputWorkingFile
                %outputWorkingDir/%inputWorkingFile.jpg"

Special Functions

Furthermore, pfdo_run offers the ability to apply some interal functions to a tag. The template for specifying a function to apply is:

%_<functionName>[|arg1|arg2|...]_<tag>

thus, a function is identified by a <functionName> that is prefixed and suffixed by an underscore _ and appears in front of the tag to process. Possible args to the <functionName> are separated by pipe | characters.

For example a string snippet that contains

%_strrepl|.|-_inputWorkingFile.txt

will replace all occurences of . in the %inputWorkingFile with -. Also of interest, the trailing .txt is preserved in the final pattern for the result.

The following functions are available:

%_md5[|<len>]_<tagName>
Apply an 'md5' hash to the value referenced by <tagName> and optionally
return only the first <len> characters.

%_strmsk|<mask>_<tagName>
Apply a simple mask pattern to the value referenced by <tagName>. Chars
that are "*" in the mask are passed through unchanged. The mask and its
target should be the same length.

%_strrepl|<target>|<replace>_<tagName>
Replace the string <target> with <replace> in the value referenced by
<tagName>.

%_rmext_<tagName>
Remove the "extension" of the value referenced by <tagName>. This
of course only makes sense if the <tagName> denotes something with
an extension!

%_name_<tag>
Replace the value referenced by <tag> with a name generated by the
faker module.

Functions cannot currently be nested.

Command line arguments

-I|--inputDir <inputDir>
Input base directory to traverse.

-O|--outputDir <outputDir>
The output root directory that will contain a tree structure identical
to the input directory, and each "leaf" node will contain the analysis
results.

--exec <CLIcmdToExec>
The command line expression to apply at each directory node of the
input tree. See the CLI SPECIFICATION section for more information.

[-i|--inputFile <inputFile>]
An optional <inputFile> specified relative to the <inputDir>. If
specified, then do not perform a directory walk, but convert only
this file.

[-f|--fileFilter <someFilter1,someFilter2,...>]
An optional comma-delimated string to filter out files of interest
from the <inputDir> tree. Each token in the expression is applied in
turn over the space of files in a directory location, and only files
that contain this token string in their filename are preserved.

[-d|--dirFilter <someFilter1,someFilter2,...>]
An additional filter that will further limit any files to process to
only those files that exist in leaf directory nodes that have some
substring of each of the comma separated <someFilter> in their
directory name.

[--analyzeFileIndex <someIndex>]
An optional string to control which file(s) in a specific directory
to which the analysis is applied. The default is "-1" which implies
*ALL* files in a given directory. The space of valid <someIndex> are:

    'm':   only the "middle" file in the returned file list
    "f":   only the first file in the returned file list
    "l":   only the last file in the returned file list
    "<N>": the file at index N in the file list. If this index
           is out of bounds, no analysis is performed.
    "-1":  all files.

[--outputLeafDir <outputLeafDirFormat>]
If specified, will apply the <outputLeafDirFormat> to the output
directories containing data. This is useful to blanket describe
final output directories with some descriptive text, such as
'anon' or 'preview'.

This is a formatting spec, so

    --outputLeafDir 'preview-%s'

where %s is the original leaf directory node, will prefix each
final directory containing output with the text 'preview-' which
can be useful in describing some features of the output set.

[--threads <numThreads>]
If specified, break the innermost analysis loop into <numThreads>
threads.

[--noJobLogging]
If specified, then suppress the logging of per-job output. Usually
each job that is run will have, in the output directory, three
additional files:

        %inputWorkingFile-returncode
        %inputWorkingFile-stderr
        %inputWorkingFile-stdout

By specifying this option, the above files are not recorded.

[-x|--man]
Show full help.

[-y|--synopsis]
Show brief help.

[--json]
If specified, output a JSON dump of final return.

[--followLinks]
If specified, follow symbolic links.

-v|--verbosity <level>
Set the app verbosity level.

    0: No internal output;
    1: Run start / stop output notification;
    2: As with level '1' but with simpleProgress bar in 'pftree';
    3: As with level '2' but with list of input dirs/files in 'pftree';
    5: As with level '3' but with explicit file logging for
            - read
            - analyze
            - write

Examples

Perform a pfdo_run down some input directory and convert all input jpg files to png in the output tree:

pfdo_run                                                \
    -I /var/www/html/data --fileFilter jpg              \
    -O /var/www/html/png                                \
    --exec "convert %inputWorkingDir/%inputWorkingFile
    %outputWorkingDir/%_rmext_inputWorkingFile.png"     \
    --threads 0 --printElapsedTime

The above will find all files in the tree structure rooted at /var/www/html/data that also contain the string jpg anywhere in the filename. For each file found, a convert conversion will be called, storing a converted file in the same tree location in the output directory as the original input.

Note the special construct, %_remext_inputWorkingFile.png – the %_rmext_ designates a built in funtion to apply to the tag value. In this case, to “remove the extension” from the %inputWorkingFile string.

Consider an example where only one file in a branched inputdir space is to be preserved:

pfdo_run                                                \
    -I (pwd)/raw -O (pwd)/out                           \
    -d 100307 -f " "                                    \
    --exec "cp %inputWorkingDir/brain.mgz
    %outputWorkingDir/brain.mgz"                        \
    --threads 0 --verbosity 3 --noJobLogging

Here, the input directory space is pruned for a directory leaf node that contains the string 100307. The exec command essentially copies the file brain.mgz in that target directory to the corresponding location in the output tree.

Finally the elapsed time and a JSON output are printed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pfdo_run-3.2.6.tar.gz (19.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page