Skip to main content

A command and utility functions for making listings of file content hashcodes and manipulating directory trees based on such a hash index.

Project description

A command and utility functions for making listings of file content hashcodes and manipulating directory trees based on such a hash index.

Latest release 20250528:

  • New hashindex_map(dirpath) function exposing the code to make a hashcode->[fspath,...] mapping.
  • New remote_rearrange(rhost,dstdir,fspaths_by_hashcode) function to rearrange a remote directory.
  • HashIndexCommand: new cmd_rsync() to rearrange a target then rsync to it.
  • HashIndexCommand.cmd_rearrange: honour "-" as the refdir to read the hash index from standard input.
  • HashIndexCommand.cmd_rearrange: default to move mode, change the CLI options to have --ln instead of --mv.
  • HashIndexCommand.cmd_rearrange: new -1 (once) option to only do a single file rename, handy for testing.
  • Redo almost the entire merge() function for clearer logic.

This largely exists to solve my "what has changed remotely?" or "what has been filed where?" problems by comparing file trees using the files' content hashcodes.

This does require reading every file once to compute its hashcode, but the hashcodes (and file sizes and mtimes when read) are stored beside the file in .fstags files (see the cs.fstags module), so that a file does not need to be reread on subsequent comparisons.

hashindex knows how to invoke itself remotely using ssh (this does require hashindex to be installed on the remote host) and can thus be used to compare a local and remote tree, for example:

hashindex comm -1 localtree remotehost:remotetree

When you point hashindex at a remote tree, it uses ssh to run hashindex on the remote host, so all the content hashing is done locally to the remote host instead of copying files over the network.

You can also use it to rearrange a tree based on the locations of corresponding files in another tree. Consider a media tree replicated between 2 hosts. If the source tree gets rearranged, the destination can be equivalently rearranged without copying the files, for example:

hashindex rearrange sourcehost:sourcetree localtree

If fstags mv was used to do the original rearrangement then the hashcodes will be copied to the new locations, saving a rescan of the source file. I keep a shell alias mv="fstags mv" so this is routine for me.

A common "backup to remote" use case of mine is addressed by:

hashindex rsync src dst

which rearranges dst based on src, then uses rsync(1) to update dst.

I have a backup script histbackup which works by making a hard link tree of the previous backup and rsyncing into it. It has long been subject to huge transfers if the source tree gets rearranged. Now it has a --hashindex option to get it to run a hashindex rearrange between the hard linking to the new backup tree and the rsync from the source to the new tree.

If network bandwith is limited or quotaed, you can use the comparison function to prepare a list of files missing from the remote location and copy them to a transfer drive for carrying to the remote site when opportune. Example:

hashindex comm -1 -o '{fspath}' src rhost:dst \
| rsync -a --files-from=- src/ xferdir/

I've got a script pref-xfer which does this with some conveniences and sanity checks.

Short summary:

  • dir_filepaths: Generator yielding the filesystem paths of the files in dirpath.
  • dir_remap: Generator yielding (srcpath,[remapped_paths]) 2-tuples based on the hashcodes keying fspaths_by_hashcode.
  • file_checksum: Return the hashcode for the contents of the file at fspath. Warn and return None on OSError.
  • hashindex: Generator yielding (hashcode,filepath) 2-tuples for the files in src, which may be a file or a RemotePath or a (host,fspath) 2-tuple or a filesystem path. Note that this yields (None,filepath) for files which cannot be accessed.
  • hashindex_map: Construct a mapping of hashcodes to filesystem paths by walking dirpath.
  • HashIndexCommand: A tool to generate and use indices of file content hashcodes.
  • localpath: Return a filesystem path modified so that it connot be misinterpreted as a remote path such as user@host:path.
  • main: Commandline implementation.
  • merge: Merge srcpath to dstpath, preserving dstpath if present. Return True if something was done, False if this was a no-op.
  • paths_remap: Generator yielding (srcpath,fspaths) 2-tuples.
  • read_hashindex: A generator which reads line from the file f and yields (hashcode,fspath) 2-tuples for each line. If there are parse errors the hashcode or fspath may be None.
  • read_remote_hashindex: A generator which reads a hashindex of a remote directory, This runs: hashindex ls -h hashname -r rdirpath on the remote host. It yields (hashcode,fspath) 2-tuples.
  • rearrange: Rearrange the files in dirpath according to the hashcode->[relpaths] fspaths_by_hashcode.
  • remote_rearrange: Rearrange a remote directory srcdir on rhost into dstdir on rhost according to the hashcode mapping fspaths_by_hashcode.
  • run_remote_hashindex: Run a remote hashindex command. Return the CompletedProcess result or None if doit is false. Note that as with cs.psutils.run, the arguments are resolved via cs.psutils.prep_argv.

Module contents:

  • dir_filepaths(dirpath: str, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x10fa9c0e0>): Generator yielding the filesystem paths of the files in dirpath.

  • dir_remap(srcdirpath: str, fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, hashname: str): Generator yielding (srcpath,[remapped_paths]) 2-tuples based on the hashcodes keying fspaths_by_hashcode.

  • file_checksum(fspath: str, hashname: str = 'blake3', *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x10fa9c0e0>) -> Optional[cs.hashutils.BaseHashCode]: Return the hashcode for the contents of the file at fspath. Warn and return None on OSError.

  • hashindex(src: Union[io.TextIOBase, cs.fs.RemotePath, str, Tuple[Optional[str], str]], *, hashname: str, relative: bool = False, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x10fddc2c0>, **kw) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]: Generator yielding (hashcode,filepath) 2-tuples for the files in src, which may be a file or a RemotePath or a (host,fspath) 2-tuple or a filesystem path. Note that this yields (None,filepath) for files which cannot be accessed.

  • hashindex_map(dirpath: str, *, hashname: str, relative=False) -> dict[cs.hashutils.BaseHashCode, list[str]]: Construct a mapping of hashcodes to filesystem paths by walking dirpath.

  • Class HashIndexCommand(cs.cmdutils.BaseCommand)``: A tool to generate and use indices of file content hashcodes.

    Usage summary:

    Usage: hashindex [common-options...] subcommand [options...]
      A tool to generate and use indices of file content hashcodes.
      Subcommands:
        comm [common-options...] {-1|-2|-3|-r} {path1|-} {path2|-}
          Compare the filepaths in path1 and path2 by content.
        help [common-options...] [-l] [-s] [subcommand-names...]
          Print help for subcommands.
          This outputs the full help for the named subcommands,
          or the short help for all subcommands if no names are specified.
        info [common-options...] [field-names...]
          Recite general information.
          Explicit field names may be provided to override the default listing.
        ls [common-options...] [options...] [[host:]path...]
          Walk filesystem paths and emit a listing.
          The default path is the current directory.
          In quiet mode (-q) the hash indicies are just updated
          and nothing is printed.
        rearrange [common-options...] {[[user@]host:]refdir|-} [[user@]rhost:]srcdir [dstdir]
          Rearrange files from srcdir into dstdir based on their positions in refdir.
          Arguments:
            refdir    The reference directory, which may be local or remote
                      or "-" indicating that a hash index will be read from
                      standard input.
            srcdir    The directory containing the files to be rearranged,
                      which may be local or remote.
            dstdir    Optional destination directory for the rearranged files.
                      Default is the srcdir.
        repl [common-options...]
          Run a REPL (Read Evaluate Print Loop), an interactive Python prompt.
        rsync [common-options...] [options] srcdir dstdir
          Rearrange dstdir according to srcdir then rsync srcdir into dstdir.
        shell [common-options...]
          Run a command prompt via cmd.Cmd using this command's subcommands.
    

HashIndexCommand.Options

HashIndexCommand.cmd_comm(self, argv, *, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x10fd9b560>): Usage: {cmd} {{-1|-2|-3|-r}} {{path1|-}} {{path2|-}} Compare the filepaths in path1 and path2 by content.

HashIndexCommand.cmd_ls(self, argv, *, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x10fd9b9c0>): Usage: {cmd} [options...] [[host:]path...] Walk filesystem paths and emit a listing. The default path is the current directory. In quiet mode (-q) the hash indicies are just updated and nothing is printed.

HashIndexCommand.cmd_rearrange(self, argv): Usage: {cmd} {{[[user@]host:]refdir|-}} [[user@]rhost:]srcdir [dstdir] Rearrange files from srcdir into dstdir based on their positions in refdir. Arguments: refdir The reference directory, which may be local or remote or "-" indicating that a hash index will be read from standard input. srcdir The directory containing the files to be rearranged, which may be local or remote. dstdir Optional destination directory for the rearranged files. Default is the srcdir.

HashIndexCommand.cmd_rsync(self, argv, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x10fa9c0e0>): Usage: {cmd} [options] srcdir dstdir Rearrange dstdir according to srcdir then rsync srcdir into dstdir.

HashIndexCommand.poppathspec(argv: List[str], name: str = 'dirspec', check_isdir=False) -> cs.fs.RemotePath: Pop a leading dirspec from argv, a filesystem path with an optional leading [user@]rhost: prefix. Return a (host,fspath) 2-tuple being the remote host (None if omitted) and the filesystem path. Raises GetoptError on a missing or invalid argument.

HashIndexCommand.run_context(self, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x10fa9c0e0>, **kw): Sanity check the hashname, open the fstags.

  • localpath(fspath: str) -> str: Return a filesystem path modified so that it connot be misinterpreted as a remote path such as user@host:path.

    If fspath contains no colon (:) or is an absolute path or starts with ./ then it is returned unchanged. Otherwise a leading ./ is prepended.

  • main(argv=None): Commandline implementation.

  • merge(srcpath: str, dstpath: str, *, opname=None, hashname: str, move_mode: bool = False, symlink_mode=False, doit=False, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x10fa9c0e0>, quiet: bool) -> bool: Merge srcpath to dstpath, preserving dstpath if present. Return True if something was done, False if this was a no-op.

    This is aimed at situations such as merging downloads with an existing corpus, which might have hard links etc, so dstpath is the important half of the pair.

    NB: symlink_mode is currently disabled.

    If 'dstpath' exists, checksum their contents and raise FileExistsError if they differ. If dstpath does not exist, move/link/symlink srcpath to dstpath. This also merges the fstags from srcpath to dstpath.

    Otherwise the files have the same content, merge while preserving dstpath.

  • paths_remap(srcpaths: Iterable[str], fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, hashname: str): Generator yielding (srcpath,fspaths) 2-tuples.

  • read_hashindex(f, start=1, *, hashname: str) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]: A generator which reads line from the file f and yields (hashcode,fspath) 2-tuples for each line. If there are parse errors the hashcode or fspath may be None.

  • read_remote_hashindex(rhost: str, rdirpath: str, *, hashname: str, quiet=True, ssh_exe: str, hashindex_exe: str, relative: bool = False) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]: A generator which reads a hashindex of a remote directory, This runs: hashindex ls -h hashname -r rdirpath on the remote host. It yields (hashcode,fspath) 2-tuples.

    Parameters:

    • rhost: the remote host, or user@host
    • rdirpath: the remote directory path
    • hashname: the file content hash algorithm name
    • ssh_exe: optional ssh command
    • hashindex_exe: the remote hashindex executable
    • relative: optional flag, default False; if true pass '-r' to the remote hashindex ls command
    • check: whether to check that the remote command has a 0 return code, default True
  • rearrange(srcdirpath: str, rfspaths_by_hashcode, dstdirpath: str | None = None, *, hashname: str, move_mode: bool = False, once: bool = False, symlink_mode=False, doit: bool, fstags: cs.fstags.FSTags, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x10fddcf40>, quiet: bool): Rearrange the files in dirpath according to the hashcode->[relpaths] fspaths_by_hashcode.

    Parameters:

    • srcdirpath: the directory whose files are to be rearranged
    • rfspaths_by_hashcode: a mapping of hashcode to relative pathname to which the original file is to be moved
    • dstdirpath: optional target directory for the rearranged files; defaults to srcdirpath, rearranging the files in place
    • hashname: the file content hash algorithm name
    • move_mode: move files instead of linking them
    • symlink_mode: symlink files instead of linking them
    • doit: if true do the link/move/symlink, otherwise just print
  • remote_rearrange(rhost: str, srcdir: str, dstdir: str, fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, doit: bool, hashindex_exe: str, hashname: str, move_mode: bool, once: bool, quiet: bool, symlink_mode: bool): Rearrange a remote directory srcdir on rhost into dstdir on rhost according to the hashcode mapping fspaths_by_hashcode.

  • run_remote_hashindex(rhost: str, argv, *, hashindex_exe: str, **subp_options): Run a remote hashindex command. Return the CompletedProcess result or None if doit is false. Note that as with cs.psutils.run, the arguments are resolved via cs.psutils.prep_argv.

    Parameters:

    • rhost: the remote host, or user@host
    • argv: the command line arguments to be passed to the remote hashindex command
    • check: whether to check that the remote command has a 0 return code, default True Other keyword parameters are passed therough to cs.psutils.run.

Release Log

Release 20250528:

  • New hashindex_map(dirpath) function exposing the code to make a hashcode->[fspath,...] mapping.
  • New remote_rearrange(rhost,dstdir,fspaths_by_hashcode) function to rearrange a remote directory.
  • HashIndexCommand: new cmd_rsync() to rearrange a target then rsync to it.
  • HashIndexCommand.cmd_rearrange: honour "-" as the refdir to read the hash index from standard input.
  • HashIndexCommand.cmd_rearrange: default to move mode, change the CLI options to have --ln instead of --mv.
  • HashIndexCommand.cmd_rearrange: new -1 (once) option to only do a single file rename, handy for testing.
  • Redo almost the entire merge() function for clearer logic.

Release 20241207: Mostly CLI usage improvements.

Release 20241007: Small internal changes.

Release 20240709:

  • Require blake3 and use it as the default hash algorithm.
  • Some internal improvements.

Release 20240623: hashindex: plumb hashname to file_checksum.

Release 20240412:

  • file_checksum: skip any nonregular file, only use run_task when checksumming more than 1MiB.
  • HashIndexCommand.cmd_rearrange: run the refdir index in relative mode.
  • Small fixes.

Release 20240317:

  • HashIndexCommand.cmd_ls: default to listing the current directory.
  • HashIndexCommand: new -o output_format to allow outputting only hashcodes or fspaths.
  • HashIndexCommand.cmd_comm: new -r (relative) option.

Release 20240316: Fixed release upload artifacts.

Release 20240305:

  • HashIndexCommand.cmd_ls: support rhost:rpath paths, honour intterupts in the remote mode.
  • HashIndexCommand.cmd_rearrange: new optional dstdir command line argument, passed to rearrange.
  • merge: symlink_mode: leave identical symlinks alone, just merge tags.
  • rearrange: new optional dstdirpath parameter, default srcdirpath.

Release 20240216:

  • HashIndexCommand.cmdlinkto,cmd_rearrange: run the link/mv stuff with sys.stdout in line buffered mode.
  • DO not get hashcodes from symlinks.
  • HashIndexCommand.cmd_ls: ignore None hashcodes, do not set xit=1.
  • New run_remote_hashindex() and read_remote_hashindex() functions.
  • dir_filepaths: skip dot files, the fstags .fstags file and nonregular files.

Release 20240211.1: Better module docstring.

Release 20240211: Initial PyPI release: "hashindex" command and utility functions for listing file hashcodes and rearranging trees based on a hash index.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cs_hashindex-20250528.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cs_hashindex-20250528-py2.py3-none-any.whl (16.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file cs_hashindex-20250528.tar.gz.

File metadata

  • Download URL: cs_hashindex-20250528.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for cs_hashindex-20250528.tar.gz
Algorithm Hash digest
SHA256 e0acca40b3f7b806a18c99f0cf9c688f4a8533670331c07b0af800eeb5126c40
MD5 da45c2487d91b401f40f66aa2ca83a16
BLAKE2b-256 5e61ce2bf027f1a5e9fc916703bb5f5f953f07dcc0a2ace81c79105c36d44917

See more details on using hashes here.

File details

Details for the file cs_hashindex-20250528-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cs_hashindex-20250528-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6000f3d6792dd2a14154c36be236c2d33a77a5bd71b4ab67380041663c2d29f3
MD5 c9c8759a3af788db626c3958d4984b7e
BLAKE2b-256 59c311e7aedd1e05f84fc1bf3b1019e06a20d968a74892e62f202f05910d8df5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page