Highly Parallel CoPy / HPC coPy: A simple script optimized for distributed file store / NVMe / SSD storage medias for use in High Performace Computing environments.
Project description
hpcp
A simple script that can issue multiple cp -af commands simultaneously on a local system.
Optimized for use in HPC scenarios and featuring auto-tuning for files-per-process.
Includes an adaptive progress bar for copying tasks from multiCMD.
Tested on a Lustre filesystem with 1.5 PB capacity running on 180 HDDs. Compared to using tar, hpcp reduced the time for tarball/image release from over 8 hours to under 10 minutes.
Development status
Basic functionality (parallel copy) should be stable.
Imaging functionality (source/destination as .img files) will be extended with differential image support (differential backup). Imaging is only available on Linux—similar to tar, but uses disk images.
Block-image functionality is in beta. Only available on Linux. Possible use case: cloning a currently running OS without mounting / as read-only.
hpcp.bat available on github: simple old tk based GUI intended for basic windows functionality.
Important Implementation Detail
By default, hpcp only checks:
- The file’s relative path/name is identical.
- The file mtime is identical.
- The last
-hs --hash_sizebytes (defaults to65536) are identical.
Although in most cases these checks should confirm that both files are identical, in certain scenarios (like bit rot), corrupted files might not be detected. If you need to verify file integrity rather than perform a quick sync, it is recommended to use the -fh --full_hash option.
Setting -hs --hash_size to 0 disables hash checks entirely. This can be helpful on HDDs, as they usually have suboptimal seek performance. However, HDDs are also more prone to bit rot. If the operator can accept that risk, it is possible to rely solely on mtime checks for file comparison by setting hash_size to 0. (Though on a single HDD, the standard cp command is already well-optimized.)
Installation
pipx install hpcp
or
pip install hpcp
After installation, hpcp is available as hpcp. You can check its version and libraries via:
hpcp -V
It is recommended to install via pip, but hpcp can also function as a single script using Python’s default libraries.
Note:
- Using
pipwill optionally install the hashing library xxhash, which can reduce CPU usage for partial hashing and increase performance when using-fh --full_hash. pipalso installs multiCMD, used to issue commands and provide helper functions. If it is not available,hpcp.pywill use its built-in multiCMD interface, which is more limited, has lower performance, and may have issues with files containing spaces. Please install multiCMD if possible.
Disk Imaging Feature Note
Only available on Linux currently!
-dd --disk_dump mode differs from the standard Linux dd program. hpcp will try to mount the block device/image file to a temporary directory and perform a file-based copy to an identically-created image file specified with -di --dest_image. This functionality is implemented crudely and is still an alpha feature. It works on basic partition types (it does not work with LVM) with GPT partition tables and has been proven able to clone live running system disks to disk images, which can then be booted without issues.
The created disk image can be resized using the -ddr --dd_resize option to the desired size. (This feature is provided so that you can shrink the raw size of the resulting image and provides some shrink capability for XFS.)
For partitions that hpcp cannot create a separate unique mount point, hpcp will fall back to using the Linux program dd to clone the drive. Note that this can be risky and can lead to broken filesystems if the drive is actively being written to. (However, since you generally cannot mount that partition on the current OS, the real-world scenarios for this remain limited.)
Remove Extra Feature Note
-rme --remove_extra: Especially when combined with -rf, PLEASE PAY CLOSE ATTENTION TO YOUR TARGET DIRECTORY!
--remove_extra will remove all files that are not in the source path. When you are copying a file into a folder, you almost certainly do not want to use this!
Remove Feature Note
-rm --remove can remove files in bulk. This might be helpful on distributed file systems like Lustre, as it only gathers the file list once and performs bulk deletion rather than the default recursive deletion in the Linux rm program.
-rf --remove_force implies --remove. Use with care! This skips the interactive check requiring user confirmation before removing. If hpcp did not generate the correct file list from the specified source paths, hopefully you have fast enough reflexes to press Ctrl + C repeatedly to stop all parallel deletion processes if you realize a mistake.
-b --batch: Using -b with -rm will gather the file list for all source_paths first, then issue the remove command. This can be helpful because hpcp will tune its -f --files_per_job parameter accordingly for each task, and running one large remove job might be faster than running many small ones. This is especially useful when working with glob patterns like *.
$ hpcp -h
usage: hpcp.py [-h] [-s] [-j MAX_WORKERS] [-b | -nb] [-v] [-do] [-nds] [-fh] [-hs HASH_SIZE] [-fpj FILES_PER_JOB] [-sfl SOURCE_FILE_LIST]
[-fl TARGET_FILE_LIST] [-cfl] [-dfl [DIFF_FILE_LIST]] [-tdfl] [-nhfl] [-rm] [-rf] [-rme] [-e EXCLUDE] [-x EXCLUDE_FILE]
[-nlt] [-V] [-pfl] [-si SRC_IMAGE] [-siff LOAD_DIFF_IMAGE] [-d DEST_PATH] [-rds] [-di DEST_IMAGE] [-dis DEST_IMAGE_SIZE]
[-diff] [-dd] [-ddr DD_RESIZE] [-L RATE_LIMIT] [-F FILE_RATE_LIMIT] [-tfs TARGET_FILE_SYSTEM] [-ncd]
[-ctl COMMAND_TIMEOUT_LIMIT] [-enes]
[src_path ...]
Copy files from source to destination
positional arguments:
src_path Source Path
options:
-h, --help show this help message and exit
-s, --single_thread Use serial processing
-j, -m, -t, --max_workers MAX_WORKERS
Max workers for parallel processing. Default is 4 * CPU count. Use negative numbers to indicate {n} * CPU count, 0
means 1/2 CPU count.
-b, --batch Batch mode, process all files in one go
-nb, --no_batch, --sequential
Do not use batch mode
-v, --verbose Verbose output
-do, --directory_only
Only copy directory structure
-nds, --no_directory_sync
Do not sync directory metadata, useful for verfication
-fh, --full_hash Checks the full hash of files
-hs, --hash_size HASH_SIZE
Hash size in bytes, default is 65536. This means hpcp will only check the last 64 KiB of the file.
-fpj, --files_per_job FILES_PER_JOB
Base number of files per job, will be adjusted dynamically. Default is 1
-sfl, -lfl, --source_file_list SOURCE_FILE_LIST
Load source file list from file. Will treat it raw meaning do not expand files / folders. files are seperated
using newline. If --compare_file_list is specified, it will be used as source for compare
-fl, -tfl, --target_file_list TARGET_FILE_LIST
Specify the file_list file to store list of files in src_path to. If --compare_file_list is specified, it will be
used as targets for compare
-cfl, --compare_file_list
Only compare file list. Use --file_list to specify a existing file list or specify the dest_path to compare
src_path with. When not using with file_list, will compare hash.
-dfl, --diff_file_list [DIFF_FILE_LIST]
Implies --compare_file_list, specify a file name to store the diff file list to or omit the value to auto-
determine.
-tdfl, --tar_diff_file_list
Generate a tar compatible diff file list. ( update / new files only )
-nhfl, --no_hash_file_list
Do not append hash to file list
-rm, --remove Remove all files and folders specified in src_path
-rf, --remove_force Remove all files without prompt
-rme, --remove_extra Remove all files and folders in dest_path that are not in src_path
-e, --exclude EXCLUDE
Exclude source files matching the pattern
-x, --exclude_file EXCLUDE_FILE
Exclude source files matching the pattern in the file
-nlt, --no_link_tracking
Do not copy files that symlinks point to.
-V, --version show program's version number and exit
-pfl, --parallel_file_listing
Use parallel processing for file listing
-si, --src_image SRC_IMAGE
Source Image, mount the image and copy the files from it.
-siff, --load_diff_image LOAD_DIFF_IMAGE
Not implemented. Load diff images and apply the changes to the destination.
-d, -C, --dest_path DEST_PATH
Destination Path
-rds, --random_dest_selection
Randomly select destination path from the list of destination paths instead of filling round robin. Can speed up
transfer if dests are on different devices. Warning: can cause unable to fit in big files as dests are filled up
by smaller files.
-di, --dest_image DEST_IMAGE
Base name for destination Image, create a image file and copy the files into it.
-dis, --dest_image_size DEST_IMAGE_SIZE
Destination Image Size, specify the size of the destination image to split into. Default is 0 (No split). Example:
{10TiB} or {1G}
-diff, --get_diff_image
Not implemented. Compare the source and destination file list, create a diff image of that will update the
destination to source.
-dd, --disk_dump Disk to Disk mirror, use this if you are backuping / deploying an OS from / to a disk. Require 1 source, can be 1
src_path or 1 -si src_image, require 1 -di dest_image. Note: will only actually use dd if unable to mount / create
a partition.
-ddr, --dd_resize DD_RESIZE
Resize the destination image to the specified size with -dd. Applies to biggest partiton first. Specify multiple
-ddr to resize subsequent sized partitions. Example: {100GiB} or {200G}
-L, -rl, --rate_limit RATE_LIMIT
Approximate a rate limit the copy speed in bytes/second. Example: 10M for 10 MB/s, 1Gi for 1 GiB/s. Note: do not
work in single thread mode. Default is 0: no rate limit.
-F, -frl, --file_rate_limit FILE_RATE_LIMIT
Approximate a rate limit the copy speed in files/second. Example: 10K for 10240 files/s, 1Mi for 1024*1024*1024
files/s. Note: do not work in serial mode. Default is 0: no rate limit.
-tfs, --target_file_system TARGET_FILE_SYSTEM
Specify the target file system type. Will abort if the target file system type does not match. Example: ext4, xfs,
ntfs, fat32, exfat. Default is None: do not check target file system type.
-ncd, --no_create_dir
Ignore any destination folder that does not already exist. ( Will still copy if dest is a file )
-ctl, --command_timeout_limit COMMAND_TIMEOUT_LIMIT
Set the command timeout limit in seconds for external commands ( ex. cp / dd ). Default is 0: no timeout.
-enes, --exit_not_enough_space
Exit if there is not enough space on the destination instead of continuing (Note: Default is continue as in
compressed fs copy can be down even if source is bigger than free space).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hpcp-9.42.tar.gz.
File metadata
- Download URL: hpcp-9.42.tar.gz
- Upload date:
- Size: 55.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c29cd2cd848bcbf54bdd8e800ab3914187130790d3472ea659bca1a33bcafeb0
|
|
| MD5 |
21cb903d8e68d5b59fbcb72bc45126c2
|
|
| BLAKE2b-256 |
053c68f30e51e0d5ae277d4ab178423e9240b50b037fc0cd423f7cfc4ffb535a
|
File details
Details for the file hpcp-9.42-py3-none-any.whl.
File metadata
- Download URL: hpcp-9.42-py3-none-any.whl
- Upload date:
- Size: 45.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be221b7570589da5ea0070f6022127baaf47ca420a5a58cb8070e7c3eb4c2efe
|
|
| MD5 |
3d7249a01773192ee8dbedc6c133f880
|
|
| BLAKE2b-256 |
df7370baee641d445cfd8383c3d02b853e58e28d15e18772820f9f3b50af5eb9
|