A command and utility functions for making listings of file content hashcodes and manipulating directory trees based on such a hash index.
Project description
A command and utility functions for making listings of file content hashcodes and manipulating directory trees based on such a hash index.
Latest release 20240412:
- file_checksum: skip any nonregular file, only use run_task when checksumming more than 1MiB.
- HashIndexCommand.cmd_rearrange: run the refdir index in relative mode.
- Small fixes.
This largely exists to solve my "what has changed remotely?" or "what has been filed where?" problems by comparing file trees using the files' content hashcodes.
This does require reading every file once to compute its hashcode,
but the hashcodes (and file sizes and mtimes when read) are
stored beside the file in .fstags
files (see the cs.fstags
module), so that a file does not need to be reread on subsequent
comparisons.
hashindex
knows how to invoke itself remotely using ssh
(this does require hashindex
to be installed on the remote host)
and can thus be used to compare a local and remote tree, for example:
hashindex comm -1 localtree remotehost:remotetree
When you point hashindex
at a remote tree, it uses ssh
to
run hashindex
on the remote host, so all the content hashing
is done locally to the remote host instead of copying files
over the network.
You can also use it to rearrange a tree based on the locations of corresponding files in another tree. Consider a media tree replicated between 2 hosts. If the source tree gets rearranged, the destination can be equivalently rearranged without copying the files, for example:
hashindex rearrange sourcehost:sourcetree localtree
If fstags mv
was used to do the original rearrangement then
the hashcodes will be copied to the new locations, saving a
rescan of the source file. I keep a shell alias mv="fstags mv"
so this is routine for me.
I have a backup script histbackup
which works by making a hard link tree of the previous backup
and rsync
ing into it. It has long been subject to huge
transfers if the source tree gets rearranged. Now it has a
--hashindex
option to get it to run a hashindex rearrange
between the hard linking to the new backup tree and the rsync
from the source to the new tree.
If network bandwith is limited or quotaed, you can use the comparison function to prepare a list of files missing from the remote location and copy them to a transfer drive for carrying to the remote site when opportune. Example:
hashindex comm -1 -o '{fspath}' src rhost:dst \
| rsync -a --files-from=- src/ xferdir/
I've got a script pref-xfer
which does this with some conveniences and sanity checks.
Function dir_filepaths(dirpath: str, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x101bcd510>)
Generator yielding the filesystem paths of the files in dirpath
.
Function dir_remap(srcdirpath: str, fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, hashname: str)
Generator yielding (srcpath,[remapped_paths])
2-tuples
based on the hashcodes keying rfspaths_by_hashcode
.
Function file_checksum(fspath: str, hashname: str = 'sha256', *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x101bcd510>) -> Optional[cs.hashutils.BaseHashCode]
Return the hashcode for the contents of the file at fspath
.
Warn and return None
on OSError
.
Function get_fstags_hashcode(fspath: str, hashname: str, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x101bcd510>) -> Tuple[Optional[cs.hashutils.BaseHashCode], Optional[os.stat_result]]
Obtain the hashcode cached in the fstags if still valid.
Return a 2-tuple of (hashcode,stat_result)
where hashcode
is a BaseHashCode
subclass instance is valid
or None
if missing or no longer valid
and stat_result
is the current os.stat
result for fspath
.
Function hashindex(fspath: Union[str, io.TextIOBase, Tuple[Optional[str], str]], *, hashname: str, hashindex_exe: str, ssh_exe: str, relative: bool = False, **kw) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]
Generator yielding (hashcode,filepath)
2-tuples
for the files in fspath
, which may be a file or directory path.
Note that it yields (None,filepath)
for files which cannot be accessed.
Class HashIndexCommand(cs.cmdutils.BaseCommand)
A tool to generate indices of file content hashcodes and to link or otherwise rearrange files to destinations based on their hashcode.
Command line usage:
Usage: hashindex subcommand...
Generate or process file content hash listings.
Subcommands:
comm {-1|-2|-3} {path1|-} {path2|-}
Compare the filepaths in path1 and path2 by content.
-1 List hashes and paths only present in path1.
-2 List hashes and paths only present in path2.
-3 List hashes and paths present in path1 and path2.
-e ssh_exe Specify the ssh executable.
-h hashname Specify the file content hash algorithm name.
-H hashindex_exe
Specify the remote hashindex executable.
-o output_format Default: '{hashcode} {fspath}'.
-r Emit relative paths in the listing.
help [-l] [subcommand-names...]
Print help for subcommands.
This outputs the full help for the named subcommands,
or the short help for all subcommands if no names are specified.
-l Long help even if no subcommand-names provided.
ls [options...] [[host:]path...]
Walk filesystem paths and emit a listing.
The default path is the current directory.
Options:
-e ssh_exe Specify the ssh executable.
-h hashname Specify the file content hash algorithm name.
-H hashindex_exe
Specify the remote hashindex executable.
-o output_format Default: '{hashcode} {fspath}'.
-r Emit relative paths in the listing.
This requires each path to be a directory.
rearrange [options...] {[[user@]host:]refdir|-} [[user@]rhost:]targetdir [dstdir]
Rearrange files from targetdir into dstdir based on their positions in refdir.
Options:
-e ssh_exe Specify the ssh executable.
-h hashname Specify the file content hash algorithm name.
-H hashindex_exe
Specify the remote hashindex executable.
--mv Move mode.
-n No action, dry run.
-o output_format Default: '{hashcode} {fspath}'.
-s Symlink mode.
Other arguments:
refdir The reference directory, which may be local or remote
or "-" indicating that a hash index will be read from
standard input.
targetdir The directory containing the files to be rearranged,
which may be local or remote.
dstdir Optional destination directory for the rearranged files.
Default is the targetdir.
It is taken to be on the same host as targetdir.
shell
Run a command prompt via cmd.Cmd using this command's subcommands.
HashIndexCommand.Options
Method HashIndexCommand.cmd_comm(self, argv, *, runstate: Optional[cs.resources.RunState] = <function <lambda> at 0x101a85d80>)
:
Usage: {cmd} {{-1|-2|-3}} {{path1|-}} {{path2|-}}
Compare the filepaths in path1 and path2 by content.
-1 List hashes and paths only present in path1.
-2 List hashes and paths only present in path2.
-3 List hashes and paths present in path1 and path2.
-e ssh_exe Specify the ssh executable.
-h hashname Specify the file content hash algorithm name.
-H hashindex_exe
Specify the remote hashindex executable.
-o output_format Default: {OUTPUT_FORMAT_DEFAULT!r}.
-r Emit relative paths in the listing.
Method HashIndexCommand.cmd_ls(self, argv, *, runstate: Optional[cs.resources.RunState] = <function <lambda> at 0x101a85d80>)
:
Usage: {cmd} [options...] [[host:]path...]
Walk filesystem paths and emit a listing.
The default path is the current directory.
Options:
-e ssh_exe Specify the ssh executable.
-h hashname Specify the file content hash algorithm name.
-H hashindex_exe
Specify the remote hashindex executable.
-o output_format Default: {OUTPUT_FORMAT_DEFAULT!r}.
-r Emit relative paths in the listing.
This requires each path to be a directory.
Method HashIndexCommand.cmd_rearrange(self, argv)
:
Usage: {cmd} [options...] {{[[user@]host:]refdir|-}} [[user@]rhost:]targetdir [dstdir]
Rearrange files from targetdir into dstdir based on their positions in refdir.
Options:
-e ssh_exe Specify the ssh executable.
-h hashname Specify the file content hash algorithm name.
-H hashindex_exe
Specify the remote hashindex executable.
--mv Move mode.
-n No action, dry run.
-o output_format Default: {OUTPUT_FORMAT_DEFAULT!r}.
-s Symlink mode.
Other arguments:
refdir The reference directory, which may be local or remote
or "-" indicating that a hash index will be read from
standard input.
targetdir The directory containing the files to be rearranged,
which may be local or remote.
dstdir Optional destination directory for the rearranged files.
Default is the targetdir.
It is taken to be on the same host as targetdir.
Function localpath(fspath: str) -> str
Return a filesystem path modified so that it connot be
misinterpreted as a remote path such as user@host:path
.
If fspath
contains no colon (:
) or is an absolute path
or starts with ./
then it is returned unchanged.
Otherwise a leading ./
is prepended.
Function main(argv=None)
Commandline implementation.
Function merge(srcpath: str, dstpath: str, *, opname=None, hashname: str, move_mode: bool = False, symlink_mode=False, doit=False, quiet=False, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x101bcd510>)
Merge srcpath
to dstpath
.
If dstpath
does not exist, move/link/symlink srcpath
to dstpath
.
Otherwise checksum their contents and raise FileExistsError
if they differ.
Function paths_remap(srcpaths: Iterable[str], fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, hashname: str)
Generator yielding (srcpath,fspaths)
2-tuples.
Function read_hashindex(f, start=1, *, hashname: str) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]
A generator which reads line from the file f
and yields (hashcode,fspath)
2-tuples for each line.
If there are parse errors the hashcode
or fspath
may be None
.
Function read_remote_hashindex(rhost: str, rdirpath: str, *, hashname: str, ssh_exe=None, hashindex_exe=None, relative: bool = False, check=True) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]
A generator which reads a hashindex of a remote directory,
This runs: hashindex ls -h hashname -r rdirpath
on the remote host.
It yields (hashcode,fspath)
2-tuples.
Parameters:
rhost
: the remote host, oruser@host
rdirpath
: the remote directory pathhashname
: the file content hash algorithm namessh_exe
: thessh
executable, defaultSSH_EXE_DEFAULT
:'ssh'
hashindex_exe
: the remotehashindex
executable, defaultHASHINDEX_EXE_DEFAULT
:'hashindex'
relative
: optional flag, defaultFalse
; if true pass'-r'
to the remotehashindex ls
commandcheck
: whether to check that the remote command has a0
return code, defaultTrue
Function rearrange(srcdirpath: str, rfspaths_by_hashcode, dstdirpath=None, *, hashname: str, move_mode: bool = False, symlink_mode=False, doit: bool, quiet: bool = False, fstags: cs.fstags.FSTags, runstate: Optional[cs.resources.RunState] = <function <lambda> at 0x101a85d80>)
Rearrange the files in dirpath
according to the
hashcode->[relpaths] fspaths_by_hashcode
.
Parameters:
srcdirpath
: the directory whose files are to be rearrangedrfspaths_by_hashcode
: a mapping of hashcode to relative pathname to which the original file is to be moveddstdirpath
: optional target directory for the rearranged files; defaults tosrcdirpath
, rearranging the files in placehashname
: the file content hash algorithm namemove_move
: move files instead of linking themsymlink_mode
: symlink files instead of linking themdoit
: if true do the link/move/symlink, otherwise just printquiet
: defaultFalse
; if true do not print
Function run_remote_hashindex(rhost: str, argv, *, ssh_exe=None, hashindex_exe=None, check: bool = True, doit: bool = None, quiet: Optional[bool] = None, options: Optional[cs.cmdutils.BaseCommandOptions] = <function uses_cmd_options.<locals>.<lambda> at 0x101c29bd0>, **subp_options)
Run a remote hashindex
command.
Return the CompletedProcess
result or None
if doit
is false.
Note that as with cs.psutils.run
, the arguments are resolved
via cs.psutils.prep_argv
.
Parameters:
rhost
: the remote host, oruser@host
argv
: the command line arguments to be passed to the remotehashindex
commandssh_exe
: thessh
executable, defaultSSH_EXE_DEFAULT
:'ssh'
hashindex_exe
: the remotehashindex
executable, defaultHASHINDEX_EXE_DEFAULT
:'hashindex'
check
: whether to check that the remote command has a0
return code, defaultTrue
doit
: whether to actually run the command, defaultTrue
Other keyword parameters are passed therough tocs.psutils.run
.
Function set_fstags_hashcode(fspath: str, hashcode, S: os.stat_result, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x101bcd510>)
Record hashcode
against fspath
.
Release Log
Release 20240412:
- file_checksum: skip any nonregular file, only use run_task when checksumming more than 1MiB.
- HashIndexCommand.cmd_rearrange: run the refdir index in relative mode.
- Small fixes.
Release 20240317:
- HashIndexCommand.cmd_ls: default to listing the current directory.
- HashIndexCommand: new -o output_format to allow outputting only hashcodes or fspaths.
- HashIndexCommand.cmd_comm: new -r (relative) option.
Release 20240316: Fixed release upload artifacts.
Release 20240305:
- HashIndexCommand.cmd_ls: support rhost:rpath paths, honour intterupts in the remote mode.
- HashIndexCommand.cmd_rearrange: new optional dstdir command line argument, passed to rearrange.
- merge: symlink_mode: leave identical symlinks alone, just merge tags.
- rearrange: new optional dstdirpath parameter, default srcdirpath.
Release 20240216:
- HashIndexCommand.cmdlinkto,cmd_rearrange: run the link/mv stuff with sys.stdout in line buffered mode.
- DO not get hashcodes from symlinks.
- HashIndexCommand.cmd_ls: ignore None hashcodes, do not set xit=1.
- New run_remote_hashindex() and read_remote_hashindex() functions.
- dir_filepaths: skip dot files, the fstags .fstags file and nonregular files.
Release 20240211.1: Better module docstring.
Release 20240211: Initial PyPI release: "hashindex" command and utility functions for listing file hashcodes and rearranging trees based on a hash index.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cs.hashindex-20240412.tar.gz
.
File metadata
- Download URL: cs.hashindex-20240412.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38c5db075c97d2e27f87aa83c9c7e0ef47be5d258c56675333b9dbec36aacb8d |
|
MD5 | ec798ef74557a57ef59a1a25ec0fb42e |
|
BLAKE2b-256 | c5f1ac0e2d60c387f05c03cfb8918535f72d1f5923bd08b3f3c68d0e170d42f1 |
File details
Details for the file cs.hashindex-20240412-py3-none-any.whl
.
File metadata
- Download URL: cs.hashindex-20240412-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 552a47f73639eb6a88bfa7a57673544752a5ad54e87bc23fa5303135dba5eaf2 |
|
MD5 | aab6775d211abe8e48d26dc2833509f3 |
|
BLAKE2b-256 | fff1d7933a17717c80f844ee160ca529467525726ace463147b293f811df069d |