Skip to main content

Stable, fast hash of collection of content of files and directories, optionally including permissions, dates, etc.

Project description

file-collection-hash: Generate stable hash of a directory or

License: MIT Latest release

A Python commandline tool and callable function that can efficiently compute a repeatable hash string for the content of a directory or a collection of files.

Table of contents

Introduction

Python package file-collection-hash provides a command-line tool as well as a runtime function to efficiently generate a stable content hash for a directory or collection of files. In general, a directory created with rsync -a old_dir/ new_dir/ will produce the same hash. The hash includes the data of all files, so it is reliable regardless of file timestamps, etc.

Files within a directory are processed in alhabetically sorted order, so that hashes remain stable across directory reconstruction.

Relative pathnames are included in the path, so that if a file is renamed, the hash will change.

By default, file modify timestamps, file owner/UID, and file group/GID are ignored for the purposes of hashing, so that directories cloned onto different systems will hash the same even if a different user owns the directory or UID/GID mappings are different. Options are provided to enabled includion of these properties in the hash.

By default, file permission/mode bits (e.g., Read, Write, Execute) are included in the hash; this allows applications to recognize chmod operations as significant and requiring update.

In general, the default options produce a hash that changes under similar conditions to when git status would show a change.

The hashing function can be any filter command that takes a byte stream as input and produces a whitespace-free textual hash as output. Any output from the first whitespace on is stripped.

file-collection-hash delegates all of the heavy lifting to two very optimized native external commands, piped together:

  1. tar is used to render all included files and directories into a repeatable byte stream. Command options on tar are used to sort the input files and to hide variations in owner, group, modify timestamps, and permission bits as required. The output of tar is piped directly into the hashing filter.
  2. The hashing filter command (by default sha256sum) has its stdin piped directly from the tar output.

This package was originally developed as part of a solution to update .tar.gz files, triggering dependent actions, only when there is a material change in the content being bundled, ignoring differences in timestamp and file owner/group settings.

Installation

Prerequisites

Python: Python 3.7+ is required. See your OS documentation for instructions.

From PyPi

The current released version of file-collection-hash can be installed with

pip3 install pulumi-crypto

From GitHub

Poetry is required; it can be installed with:

curl -sSL https://install.python-poetry.org | python3 -

Clone the repository and install pulumi-crypto into a private virtualenv with:

cd <parent-folder>
git clone https://github.com/sammck/file-collection-hash.git
cd file-collection-hash
poetry install

You can then launch a bash shell with the virtualenv activated using:

poetry shell

Usage

Command Line

Example usage:

$ file-collection-hash --exclude=.git --exclude=.venv
a25f091c7de730931480a97243a15cfce7cd0fe07eee925749e5dc37a573237e
$ file-collection-hash -C scripts
f039c1016394986afb86436e58a3708fcd375789f95f178c7c340e29f01cf637
$ file-collection-hash -C scripts --no-ignore-owner --no-ignore-group
bb6d86071992c01336eaaa05cf2fdb64896b339f4fcf048cda45fa2c12aa7db6
$ cd scripts
$ file-collection-hash
f039c1016394986afb86436e58a3708fcd375789f95f178c7c340e29f01cf637

API

#!/usr/bin/env python3

import os
from file_collection_hash import file_collection_hash

print(file_collection_hash(exclude=['.git', '.venv']))
print(file_collection_hash('scripts'))
print(file_collection_hash('scripts', ignore_owner=False, ignore_group=False))
os.chdir('scripts')
print(file_collection_hash())

Known issues and limitations

  • TBD.

Getting help

Please report any problems/issues here.

Contributing

Pull requests welcome.

License

pulumi-crypto is distributed under the terms of the MIT License. The license applies to this file and other files in the GitHub repository hosting this file.

Authors and history

The author of file-collection-hash is Sam McKelvie.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file-collection-hash-1.0.0.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

file_collection_hash-1.0.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file file-collection-hash-1.0.0.tar.gz.

File metadata

  • Download URL: file-collection-hash-1.0.0.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.12 Linux/5.13.0-1023-azure

File hashes

Hashes for file-collection-hash-1.0.0.tar.gz
Algorithm Hash digest
SHA256 002d7767b6d683f3d6986468ab1e07eb2ed35e3f9473e13d4b703bf398329b7d
MD5 1a04bce8b3f334d1f5818f340a34666f
BLAKE2b-256 064e7dd01ea921a83c431d3921428eb20cc2298bd2bf65cbedb7c2132fed0322

See more details on using hashes here.

File details

Details for the file file_collection_hash-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for file_collection_hash-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bbb667416cb28eac24e9aec55add09d6ed92090c036c0c6b867f4831d299e852
MD5 b91051b88d7b06648a8d9553553d86ee
BLAKE2b-256 e601f95ef7862b18d8243403e5ee584d4f4be74b812ae8d9afbbf01ad657c0be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page