Skip to main content

Python module and CLI for hashing of file system directories.

Project description

codecov

dirhash

A lightweight python module and CLI for computing the hash of any directory based on its files' structure and content.

  • Supports all hashing algorithms of Python's built-in hashlib module.
  • Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude.
  • Multiprocessing for up to 6x speed-up

The hash is computed according to the Dirhash Standard, which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.

Installation

From PyPI:

pip install dirhash

Or directly from source:

git clone git@github.com:andhus/dirhash-python.git
pip install dirhash/

Usage

Python module:

from dirhash import dirhash

dirpath = "path/to/directory"
dir_md5 = dirhash(dirpath, "md5")
pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"])
no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])

CLI:

dirhash path/to/directory -a md5
dirhash path/to/directory -a md5 --match "*.py"
dirhash path/to/directory -a sha1 --ignore ".*"  ".*/"

Why?

If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing.

There isn't really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for linux and python) but I couldn't find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. dirhash was created with this as the goal.

checksumdir is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests.

Performance

The python hashlib implementation of common hashing algorithms are highly optimised. dirhash mainly parses the file tree, pipes data to hashlib and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing hashlib code.

The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files.

As a reference, let's compare the performance of the dirhash CLI with the shell command:

find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5

which is the top answer for the SO-question: Linux: compute a single hash for a given folder & contents? Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8.

Implementation Test Case Time (s) Speed up
shell reference flat_1k_1MB 2.29 -> 1.0
dirhash flat_1k_1MB 1.67 1.36
dirhash(8 workers) flat_1k_1MB 0.48 4.73
shell reference nested_32k_32kB 6.82 -> 1.0
dirhash nested_32k_32kB 3.43 2.00
dirhash(8 workers) nested_32k_32kB 1.14 6.00

The benchmark was run a MacBook Pro (2018), further details and source code here.

Documentation

Please refer to dirhash -h, the python source code and the Dirhash Standard.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirhash-0.4.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dirhash-0.4.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file dirhash-0.4.0.tar.gz.

File metadata

  • Download URL: dirhash-0.4.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for dirhash-0.4.0.tar.gz
Algorithm Hash digest
SHA256 264120928ff712f8cdf7ccdbe797f3a78fa78d6a17d5ee88d757ff06d335cd31
MD5 426800df40664918c98602803ad8b097
BLAKE2b-256 5126b7eabe07b4bf44472d62c9ea7923908d8a6ecd24a50fb7aecfcb4f3119f8

See more details on using hashes here.

File details

Details for the file dirhash-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dirhash-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for dirhash-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 faa77de66a9674b06d9009f248056f25e1ae52ec36997ddb91e07bbde99c69be
MD5 1f88a48fdbe9a05fc07d8ff625d03e3a
BLAKE2b-256 4c10fb2c696724220bfbb64963a5eea6c13c4a72533b3c1b85bf03a53c40233a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page