Skip to main content

An efficient multiprocessing directory walk and search tool

Project description

fswalk

An efficient multiprocessing directory walk and search tool

Introduction

fswalk is a simple python script that recursively walks through a filesystem directory to gather files meta-data and collect them into a json file or an Elasticsearch database. It runs several processes, each responsible of doing the list of the files contained into a subdirectory. Collected meta-data are filename, path, uid, gid, size and atime. The output is either a json file sent on the fly to stdout, or an Elastisearch indexing. A simple search option is provided to retrieve files by their owner, group or a part of the name.

The script aslo provides an option to do a quick analyze of the resulting output file.

warning: When the results are sent to stdout, due to multiprocessing and not to slow down the thing, the json file is printed with an extra , sign that might break json compatibility. The pyjson5 python library allows such non-standard json file to be read.

Installation

Requirements:

  • python >= 3.5
  • python packages: requests, pyjson5, elasticsearch

Installing the current stable release:

$ pip install fswalk

Installing the latest devel snapshot:

$ pip install git+https://github.com/bzizou/fs_walk.git

Example

Start a walk into the /home/bzizou directory with 8 process, excluding the .snapshotsubdirectory and getting the result as a gzipped json file:

bzizou@f-dahu:~/git/fs_walk$ fswalk -p /home/bzizou -x '^/home/bzizou/\.snapshot/' -n 8 |gzip > /tmp/out.gz    

Analyze the output from the resulting file:

bzizou@f-dahu:~/git/fs_walk$ fswalk -a /tmp/out.gz
User                                       Size            Count
=================================================================
bzizou                               2749804131            11125
root                                 1030651826             1351
1000                                  390705282              476
11610                                    726417                7

Group                                      Size            Count
=================================================================
realuser                             2749795275            11119
root                                 1030660332             1356
1000                                  390705282              476
2222                                     726417                7
staff                                       350                1

TOTAL SIZE: 4171887656
TOTAL FILES: 12959

Same directory scan, but we index the results into an Elastisearch database:

bzizou@f-dahu:~/git/fs_walk$ fswalk -p /home/bzizou -x '^/home/bzizou/\.snapshot/' -n 8 --elastic-host=http://localhost:9200 --elastic-index=fs_walk_home -g

Do a search for all files with the "povray" string in their path name and belonging to the user which uid is 10000:

bzizou@f-dahu:~/git/fs_walk$ fswalk --elastic-host=http://localhost:9200 --elastic-index=fs_walk_home --search="10000:*:povray:*"
/home/bzizou/povray/OAR.cigri.14068.1251218.stderr
/home/bzizou/povray/OAR.cigri.14068.1251220.stderr
/home/bzizou/povray/OAR.cigri.14068.1251224.stderr
/home/bzizou/povray/OAR.cigri.14068.1251231.stderr
/home/bzizou/povray/OAR.cigri.14068.1251231.stdout
/home/bzizou/povray/OAR.cigri.14068.1251233.stderr
/home/bzizou/povray/OAR.cigri.14068.1251233.stdout
/home/bzizou/povray/OAR.cigri.14068.1251234.stderr
/home/bzizou/povray/OAR.cigri.14068.1251234.stdout
/home/bzizou/povray/OAR.cigri.14068.1251237.stderr
/home/bzizou/povray/OAR.cigri.14068.1251237.stdout
/home/bzizou/povray/OAR.cigri.14068.1251238.stderr

Usage

Usage: fswalk [options]

Options:
  -h, --help            show this help message and exit
  -p PATH, --path=PATH  Path to scan
  -n NPROC, --nproc=NPROC
                        Number of process to launch
  -x EXCLUDE_EXPR, --exclude=EXCLUDE_EXPR
                        Regular expression for path exclusion
  -a ANALYZE_FILE, --analyze=ANALYZE_FILE
                        Creates a summary based on a previously generated json
                        file
  -s SEARCH_STRING, --search=SEARCH_STRING
                        Search a subset of files with syntax:
                        [uid]:[gid]:[path_glob]:[hostname] (--analyze or
                        --elastic-host needed)
  --numeric             Output numeric uid/gid instead of names
  --hostname=HOSTNAME   Overwrite the value of the hostname string. Defaults
                        to local hostname.
  -e ELASTIC_HOST, --elastic-host=ELASTIC_HOST
                        Use an elasticsearch server for output. 'Ex:
                        http://localhost:9200'
  --elastic-index=ELASTIC_INDEX
                        Name of the elasticsearch index
  --elastic-bulk-size=MAX_BULK_SIZE
                        Size of the elastic indexing bulks
  -g, --elastic-purge-index
                        Purge the elasticsearch index before indexing

The ANALYZE_FILE parameter may be a gzip compressed json file or a plain-text json file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fswalk-1.3.0.tar.gz (7.1 kB view hashes)

Uploaded Source

Built Distribution

fswalk-1.3.0-py2.py3-none-any.whl (19.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page