Skip to main content

Python module and CLI for hashing of file system directories.

Project description

Build Status codecov

dirhash

A lightweight python module and tool for computing the hash of any directory based on its files' structure and content.

  • Supports any hashing algorithm of Python's built-in hashlib module
  • .gitignore style "wildmatch" patterns for expressive filtering of files to include/exclude.
  • Multiprocessing for up to 6x speed-up

Installation

git clone git@github.com:andhus/dirhash.git
pip install dirhash/

Usage

Python module:

from dirhash import dirhash

dirpath = 'path/to/directory'
dir_md5          = dirhash(dirpath, 'md5')
filtered_sha1    = dirhash(dirpath, 'sha1', ignore=['.*', '.*/', '*.pyc'])
pyfiles_sha3_512 = dirhash(dirpath, 'sha3_512', match=['*.py'])

CLI:

dirhash path/to/directory -a md5
dirhash path/to/directory -a sha1 -i ".*  .*/  *.pyc"
dirhash path/to/directory -a sha3_512 -m "*.py"

Why?

If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing.

There isn't really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for linux and python) but I couldn't find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. dirhash was created with this as the goal.

checksumdir is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests.

Performance

The python hashlib implementation of common hashing algorithms are highly optimised. dirhash mainly parses the file tree, pipes data to hashlib and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing hashlib code.

The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files.

As a reference, let's compare the performance of the dirhash CLI with the shell command:

find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5

which is the top answer for the SO-question: Linux: compute a single hash for a given folder & contents? Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8.

Implementation Test Case Time (s) Speed up
shell reference flat_1k_1MB 2.29 -> 1.0
dirhash flat_1k_1MB 1.67 1.36
dirhash(8 workers) flat_1k_1MB 0.48 4.73
shell reference nested_32k_32kB 6.82 -> 1.0
dirhash nested_32k_32kB 3.43 2.00
dirhash(8 workers) nested_32k_32kB 1.14 6.00

The benchmark was run a MacBook Pro (2018), further details and source code here.

Documentation

Please refer to dirhash -h and the python source code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirhash-0.1.1.tar.gz (13.4 kB view details)

Uploaded Source

File details

Details for the file dirhash-0.1.1.tar.gz.

File metadata

  • Download URL: dirhash-0.1.1.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for dirhash-0.1.1.tar.gz
Algorithm Hash digest
SHA256 dc88718f06dd7f6c3bb4fdfd1567ae161af152aecb0a74dae28fbfe726166ec3
MD5 f29a18f60abe9676db50ee87ea7f6159
BLAKE2b-256 e37f7b41eb6b6c9695569bdeaff2bdeab3fa70b6df03f6b6ae016ca8c8370ee5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page