Skip to main content

A command line tool to calculate hashes of directory trees using various hash algorithms.

Project description

hashdir

Tests Tests (Windows) Tests (macOS) Lint

A command line tool to calculate hash of directory trees using various hash algorithms.

Installing

[!WARNING] Due to an administrative issue with the PyPI account, the version currently hosted on PyPI is outdated. To ensure you are using the latest version (0.25.0+), please install directly from the source or use the docker image (ozancivaner/hashdir).

The recommended way to install hashdir is via pipx, which keeps the dependencies isolated.

From Source (Latest)

  1. Clone the repository: git clone https://github.com/user/hashdir.git && cd hashdir
  2. Install using pipx:
    • Linux: pipx install .
    • macOS: brew install pipx && pipx install .
    • Windows: pip install pipx && pipx install .

Optional Dependencies

To use the imohash algorithm (constant-time hashing for large files), install the optional extra:

pipx install ".[imohash]"

Using Docker

To use hashdir as a docker container, run:

docker pull ozancivaner/hashdir

Or to build from source, run:

docker build . --tag hashdir

In the repository root directory.

To use, mount a local directory as a volume:

docker run -v "/path/to/local/dir:/data" ozancivaner/hashdir:latest /data/

Usage

usage: hashdir [-h] [-a {md5,sha1,imohash}] [--exclude EXCLUDE]
               [--log-level {debug,info,error}] [-q] [-v]
               [directory_or_file ...]

A command line tool to calculate hashes of directory trees using various hash
algorithms.

positional arguments:
  directory_or_file     directories or files to hash

options:
  -h, --help            show this help message and exit
  -a, --algorithm {md5,sha1,imohash}
                        the hashing algorithm for files. 'imohash' is optional
                        and provides constant-time hashing for large files,
                        but produces approximate results. See documentation
                        for installation.
  --exclude EXCLUDE     exclude a pattern, like .git/* or *.log
  --log-level {debug,info,error}
                        set the logging level.
  -q, --quiet           only output the final hash value.
  -v, --version         show program's version number and exit

Using hashdir As a Library

You can use hashdir as a library by importing the hash_paths function. This is useful for verifying data integrity within automated workflows or larger applications.

from hashdir.core import hash_paths
from hashdir.algorithms import HashAlgorithm

# The hash_paths function returns a HashdirResult object
result = hash_paths(
    paths=["./my_data"],          # List of directory or file paths
    algorithm=HashAlgorithm.MD5,  # The algorithm for individual files (MD5, SHA1, or IMOHASH)
    excluded=["*.tmp", ".git/*"]  # Optional list of glob patterns to exclude
)

# Access the aggregate SHA1 hash of the entire path set
print(f"Aggregate Hash: {result.aggregate_hash}")

# Iterate over individual file results
# Entries are sorted primarily by hash and secondarily by path
for entry in result.entries:
    print(f"Path: {entry.rel_path}")
    print(f"Hash: {entry.file_hash}")

Algorithm

Hashdir performs the following steps to ensure a deterministic and stable aggregate hash:

  • Path Discovery & Normalization: Resolves input paths to absolute paths, prunes redundant entries (e.g., if a directory and a file inside it are both provided), and normalizes all discovered paths and provided parameters to the NFKD Unicode form to ensure consistent handling of filenames across different operating systems and locales.
  • Filtering: Recursively scans directories using os.scandir while skipping symlinks and applying exclusion patterns to both files and directories.
  • Hashing: Computes the hash for each file using the selected algorithm (md5 by default).
  • Summary Generation: Creates a summary string where each line contains the POSIX-style relative path and the file's hash, separated by a space.
  • Sorting: Entries are sorted primarily by their hash value and secondarily by their relative path. This ensures consistency regardless of filesystem traversal order.
  • Aggregation: Computes the final aggregate directory hash by applying the SHA1 algorithm to the entire summary string.

Assumptions and Limitations

  • Symlinks:
    • Direct Arguments: Passing a symlink as a direct input path is not supported and will raise a ValueError.
    • Traversal: hashdir ignores symlinks (both files and directories) encountered during recursive scanning to prevent infinite loops and ensure the hash reflects actual content.
  • Hardlinks: The tool does not perform inode-based deduplication. If multiple hardlinks to the same file exist within the scanned paths, each will be treated and hashed as a separate file entry.
  • Filesystems: It is assumed that the tool is operating on a standard local filesystem (POSIX or Windows). Using the tool on specialized, virtual, or network filesystems (like NFS or SMB) might result in unexpected behavior if those systems handle metadata or traversal in non-standard ways.

Compatibility and Versioning

hashdir aims for deterministic and stable aggregate hashes for a given directory structure and content. However, changes in the underlying algorithm or implementation details can lead to different aggregate hashes across versions.

  • Pinned Regression Tests: The project maintains a suite of "pinned regression tests" (tests/test_core.py) that assert specific aggregate hash values for known directory structures and content. These tests serve as a contract, ensuring that future changes do not inadvertently alter the hash output for these specific scenarios.
  • Backwards Compatibility:
    • Version 0.25.0 vs. 0.24: The aggregate hash output generated by hashdir version 0.25.0 is not backwards compatible with version 0.24. This change was due to refinements in path normalization and file discovery logic to improve determinism and handle edge cases more robustly.
    • Users relying on hashdir for integrity checks across different versions should be aware of these potential breaking changes. It is recommended to re-baseline expected hashes when upgrading to a new major or minor version.

Contributing

Contributions are welcome! To set up your development environment and ensure code quality, follow these steps:

Setup

  1. Virtual Environment: Create and activate a virtual environment to isolate dependencies.
    python3 -m venv venv
    source venv/bin/activate  # On Windows use: venv\Scripts\activate
    
  2. Install Dependencies: Install the package in editable mode along with development tools.
    pip install -e ".[dev]"
    

Linting and Testing

Before submitting a Pull Request, please run the linting tools and tests:

  • Linting: Use the Makefile to run ruff.
    • Check for issues: make check
    • Automatically apply formatting fixes: make format
  • Testing: Run the test suite to verify your changes.
    make test
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashdir-0.25.0.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hashdir-0.25.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file hashdir-0.25.0.tar.gz.

File metadata

  • Download URL: hashdir-0.25.0.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for hashdir-0.25.0.tar.gz
Algorithm Hash digest
SHA256 f293315247111f4cd9759f10d73a159ead5b7bbd567a26627b90fa42624a8825
MD5 f2bc88b42904c6926a8b2b4fdec0608d
BLAKE2b-256 d537abb2075834075cb1651f0ad3f4302cbcb5d2456144ceb7a146107fca301c

See more details on using hashes here.

File details

Details for the file hashdir-0.25.0-py3-none-any.whl.

File metadata

  • Download URL: hashdir-0.25.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for hashdir-0.25.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c7f442ca47efea180282d6b2ea4017eed02144794a212f0ed8053154aec2904
MD5 2b118529a94d478ffabc6f5ea6a3e0c8
BLAKE2b-256 4319ce911a5596bab61f7b6fff5c2a38e8b523f3d4f49e1c62dc5aa7388b2d8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page