Skip to main content

Custom file and directory checksum tool

Project description

hashio

Custom file and directory checksum and verification tool.

Features

  • multiple hash algos: c4, crc32, md5, sha256, sha512, xxh64
  • supports multiple output options: json, txt and mhl
  • recursively runs checksums on files in directory trees
  • ignores predefined file name patterns
  • collects important file stat metadata
  • supports optional caching for repeated runs and snapshots

Installation

The easiest way to install:

$ pip install -U hashio

Note: Starting with hashio 0.4.0, hashio includes an optional SQLite-backed cache. If your Python installation does not include sqlite3, either:

  • Rebuild Python with SQLite support (e.g. libsqlite3-dev, sqlite-devel)
  • Or install an earlier version:
pip install 'hashio<0.4.0'

Usage

Recursively checksum and gather metadata all the files in a dir tree, and output results to a hash.json file:

$ hashio <PATH> -o hash.json [--algo ALGO]

hashio supports .json, .txt or .mhl output formats:

$ hashio <PATH> -o hash.txt

If no output file is specified, hashio computes checksums without writing a manifest. Enable --cache to also store results in the configured cache at ${HASHIO_DB}.

Quick usage with uvx

You can run hashio instantly using uvx:

$ uvx hashio <PATH>

This downloads and runs hashio in a temporary, isolated environment — no installs, no virtualenvs, no cleanup needed. Perfect for quick hash verification tasks on large directories.

Ignorable files

Note that files matching patterns defined in config.IGNORABLE will be skipped, unless using the --force option:

$ hashio .git
path is ignorable: .git
$ hashio .git --force
hashing: 894kB [00:02, 758kB/s, files/s=412.80]

Verify paths in previously generated hash file by comparing stored mtimes (if available) or regenerated hash values if mtimes are missing or different:

$ hashio --verify hash.json

To hash the decompressed contents of .gz files instead of the archive bytes, use --uncompress. In this mode, manifest entries are written using the uncompressed filename:

$ hashio sample.txt.gz -o hash.json --uncompress
$ hashio --verify hash.json --uncompress

Note: --uncompress currently supports .gz files only, and bypasses the hash cache so compressed-byte hashes are not mixed with decompressed-content hashes.

Portability

To make a portable hash file, use -or to make the paths relative to the hash.json file :

$ hashio <DIR> -or hash.json

or use --start to make them relative to the <START> value

$ hashio <DIR> -o hash.json --start <START>

To verify the data in the hash file, run hashio from the parent dir of the data, or set --start to the parent dir:

$ hashio --verify hash.json

Environment

The following environment variables are supported, and default settings are in the config.py module.

Variable Description
BUF_SIZE chunk size in bytes when reading files
HASHIO_ALGO default hashing algorithm to use
HASHIO_DB hashio cache db location
HASHIO_FILE default hash file location
HASHIO_USE_CACHE enable cache lookups/writes during hashing
HASHIO_IGNORABLE comma separated list of ignorable file patterns
LOG_LEVEL logging level to use (DEBUG, INFO, etc)
MAX_PROCS max number hash processes to spawn

Optionally, modify the hashio.env file if using envstack, or create a new env file:

$ cp hashio.env debug.env
$ vi debug.env  # make edits
$ ./debug.env -- hashio

Metadata

By default hashio collects the following file metadata:

Key Value
name file name
atime file access time (st_atime)
ctime file creattion time (st_ctime)
mtime file modify time (st_mtime)
ino file inode (st_ino)
dev filesystem device (st_dev)
size file size in bytes
type path type - (f)ile or (d)irectory

To walk a directory and collect metadata without peforming file checksums, use the "null" hash algo:

$ hashio <DIR> -a null

To make "null" the default, update ${HASHIO_ALGO} in the environment or the hashio.env file.

$ export HASHIO_ALGO=null

Cache File and Snapshots

hashio can maintain a local SQLite cache file (by default at ~/.cache/hashio/hash.sql) to store previously computed file hashes, metadata, and snapshot history. This dramatically speeds up repeated runs and enables powerful diffing capabilities.

Caching is disabled by default for hashing runs. Enable it with either:

$ hashio --cache <PATH>

or:

$ export HASHIO_USE_CACHE=1

Use --no-cache to override the environment and force it off for a run.

Snapshots

Snapshots are point-in-time views. You can optionally record a snapshot of the current file state using:

$ hashio --snapshot SNAPSHOT_NAME

This links all scanned files to a snapshot named SNAPSHOT_NAME, allowing you to:

  • Track changes over time
  • Compare file states across points in time
  • Build file history for audit/debugging
  • Generate change reports (diffs)

Each snapshot is stored in the cache and contains only links to file metadata entries, no file duplication.

Diffing Snapshots

You can compare snapshots using:

$ hashio --diff SNAP1 SNAP2 [--start PATH]

This prints a summary of file-level changes between two snapshots:

+ file was added
- file was removed
~ file was modified

Read Buffer Size Optimization

By default, hashio uses a fixed read buffer size set in config.py. This conservative default works well across most systems, but it can be suboptimal on filesystems with large block sizes (e.g. network-attached storage).

Optional: Enable Dynamic Buffer Sizing

Dynamic read buffer sizing can be enabled to reduce IOPS and improve performance on sequential reads. When enabled:

  • Filesystem block size is determined via os.statvfs(path).f_frsize.
  • If the block size is large (≥ 128 KiB), it is used directly
  • If the block size is small, it is scaled up and clamped

Configuration

You can configure buffer size using the BUF_SIZE environment variable:

  • If BUF_SIZE is set to a positive integer, that value is used
  • If BUF_SIZE is 0 or a negative value, dynamic buffer sizing is enabled
  • If BUF_SIZE is unset, the default fixed size from config.py is used

This can be configured either:

  • In your hashio.env file (if using envstack), or
  • Directly in the environment

Examples

# use a fixed 512 KiB read buffer
export BUF_SIZE=524288

# enable dynamic sizing based on block size
export BUF_SIZE=0

Python API

Generate a hash.json file for a given path (Default is the current working directory):

from hashio.worker import HashWorker
worker = HashWorker(path, outfile="hash.json")
worker.run()

Verify pre-generated checksums stored in a hash.json file:

from hashio.encoder import verify_checksums
for algo, value, miss in verify_checksums("hash.json"):
    print(f"{algo} {miss}")

Generate a checksum of a folder:

from hashio.encoder import checksum_folder, XXH64Encoder
encoder = XXH64Encoder()
value = checksum_folder(folder, encoder)

Cache

Run the following command to safely apply the latest schema updates and create any missing indexes:

$ hashio --update-cache

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashio-0.5.1.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hashio-0.5.1-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file hashio-0.5.1.tar.gz.

File metadata

  • Download URL: hashio-0.5.1.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for hashio-0.5.1.tar.gz
Algorithm Hash digest
SHA256 c201e30f5d8036242a18c4be4e6ea9a24083fcc3d8a4a9d5483badf2ebfeac06
MD5 f8ef8a21d5c4f0ad178cafd1e34714c1
BLAKE2b-256 f6316c17d35251ac115785b904325082d4844fdda2662f5b465d14d5fb1c4a10

See more details on using hashes here.

File details

Details for the file hashio-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: hashio-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for hashio-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d708b7a7d6c8626dd65100346002727490003e5ac4e2337f1552ca5c25d2c116
MD5 cff0f81c5a0f071fd05feb8bf1b2f119
BLAKE2b-256 90ef650a6a9104a2948b6dcf5ebcb169449924a154bb1f0a3836ef5905c30ba8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page