Stable, fast hash of collection of content of files and directories, optionally including permissions, dates, etc.
Project description
file-collection-hash: Generate stable hash of a directory or
A Python commandline tool and callable function that can efficiently compute a repeatable hash string for the content of a directory or a collection of files.
Table of contents
- Introduction
- Details
- Installation
- Usage
- Known issues and limitations
- Getting help
- Contributing
- License
- Authors and history
Introduction
Python package file-collection-hash
provides a command-line tool as well as a runtime function to efficiently
generate a stable content hash for a directory or collection of files. In general, a directory created
with rsync -a old_dir/ new_dir/
will produce the same hash. The hash includes the data of
all files, so it is reliable regardless of file timestamps, etc.
Files within a directory are processed in alhabetically sorted order, so that hashes remain stable across directory reconstruction.
Relative pathnames are included in the path, so that if a file is renamed, the hash will change.
By default, file modify timestamps, file owner/UID, and file group/GID are ignored for the purposes of hashing, so that directories cloned onto different systems will hash the same even if a different user owns the directory or UID/GID mappings are different. Options are provided to enabled includion of these properties in the hash.
By default, file permission/mode bits (e.g., Read, Write, Execute) are included in the hash; this allows applications to recognize chmod operations as significant and requiring update.
In general, the default options produce a hash that changes under similar conditions to when git status
would
show a change.
The hashing function can be any filter command that takes a byte stream as input and produces a whitespace-free textual hash as output. Any output from the first whitespace on is stripped.
file-collection-hash
delegates all of the heavy lifting to two very optimized native external commands, piped together:
tar
is used to render all included files and directories into a repeatable byte stream. Command options ontar
are used to sort the input files and to hide variations in owner, group, modify timestamps, and permission bits as required. The output oftar
is piped directly into the hashing filter.- The hashing filter command (by default
sha256sum
) has its stdin piped directly from thetar
output.
This package was originally developed as part of a solution to update .tar.gz
files, triggering dependent
actions, only when there is a material change in the content being bundled, ignoring differences in timestamp
and file owner/group settings.
Installation
Prerequisites
Python: Python 3.7+ is required. See your OS documentation for instructions.
From PyPi
The current released version of file-collection-hash
can be installed with
pip3 install pulumi-crypto
From GitHub
Poetry is required; it can be installed with:
curl -sSL https://install.python-poetry.org | python3 -
Clone the repository and install pulumi-crypto into a private virtualenv with:
cd <parent-folder>
git clone https://github.com/sammck/file-collection-hash.git
cd file-collection-hash
poetry install
You can then launch a bash shell with the virtualenv activated using:
poetry shell
Usage
Command Line
Example usage:
$ file-collection-hash --exclude=.git --exclude=.venv
a25f091c7de730931480a97243a15cfce7cd0fe07eee925749e5dc37a573237e
$ file-collection-hash -C scripts
f039c1016394986afb86436e58a3708fcd375789f95f178c7c340e29f01cf637
$ file-collection-hash -C scripts --no-ignore-owner --no-ignore-group
bb6d86071992c01336eaaa05cf2fdb64896b339f4fcf048cda45fa2c12aa7db6
$ cd scripts
$ file-collection-hash
f039c1016394986afb86436e58a3708fcd375789f95f178c7c340e29f01cf637
API
#!/usr/bin/env python3
import os
from file_collection_hash import file_collection_hash
print(file_collection_hash(exclude=['.git', '.venv']))
print(file_collection_hash('scripts'))
print(file_collection_hash('scripts', ignore_owner=False, ignore_group=False))
os.chdir('scripts')
print(file_collection_hash())
Known issues and limitations
- TBD.
Getting help
Please report any problems/issues here.
Contributing
Pull requests welcome.
License
pulumi-crypto is distributed under the terms of the MIT License. The license applies to this file and other files in the GitHub repository hosting this file.
Authors and history
The author of file-collection-hash is Sam McKelvie.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file file-collection-hash-1.0.0.tar.gz
.
File metadata
- Download URL: file-collection-hash-1.0.0.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.12 Linux/5.13.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 002d7767b6d683f3d6986468ab1e07eb2ed35e3f9473e13d4b703bf398329b7d |
|
MD5 | 1a04bce8b3f334d1f5818f340a34666f |
|
BLAKE2b-256 | 064e7dd01ea921a83c431d3921428eb20cc2298bd2bf65cbedb7c2132fed0322 |
File details
Details for the file file_collection_hash-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: file_collection_hash-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.12 Linux/5.13.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbb667416cb28eac24e9aec55add09d6ed92090c036c0c6b867f4831d299e852 |
|
MD5 | b91051b88d7b06648a8d9553553d86ee |
|
BLAKE2b-256 | e601f95ef7862b18d8243403e5ee584d4f4be74b812ae8d9afbbf01ad657c0be |