Skip to main content

Simple dataset management

Project description

shelephant

CI Documentation Status Conda Version PyPi release

Command-line arguments with a memory (stored in YAML-files).

Documentation: https://shelephant.readthedocs.io

Contents

Overview

Hallmark feature: Copy with restart

shelephant presents you with a way to copy files (from a remote, using SSH) in two steps:

  1. Collect a list of files that should be copied in a YAML-file, allowing you to review and customise the copy operation (e.g. by changing the order and making last-minute manual changes).
  2. Perform the copy, efficiently skipping files that are identical.

Typical workflow:

# Collect files to copy & compute their checksum (e.g. on remote system)
# - creates "shelephant_dump.yaml"
shelephant_dump *.hdf5
# - reads "shelephant_dump.yaml"
# - creates "shelephant_checksum.yaml"
shelephant_checksum

# Combine all needed info (locally)
# - reads "shelephant_dump.yaml" and "shelephant_checksum.yaml"
# - creates "shelephant_hostinfo.yaml"
shelephant_hostinfo --host myhost --prefix /some/path --files --checksum

# Copy from remote (can be restarted and any time, existing files are skipped)
# - reads "shelephant_hostinfo.yaml"
shelephant_get
  • The filenames can be customised.
  • To copy to a remote system use shelephant_send.
  • Get details in the help of the respective commands, e.g. shelephant_dump --help.
  • shelephant works for both local as remote copy actions.

Command-line tools

File information

  • shelephant_dump: list filenames in a YAML file.
  • shelephant_checksum: get the checksums of files listed in a YAML file.
  • shelephant_hostinfo: collect host information (from a remote system).

File operations

  • shelephant_get: copy from remote, based on earlier stored information.
  • shelephant_send: copy to remote, based on earlier stored information.
  • shelephant_rm: remove files listed in a YAML file.
  • shelephant_cp: copy files listed in a YAML file.
  • shelephant_mv: move files listed in a YAML file.

YAML file operations

  • shelephant_extract: isolate a (number of) field(s) in a (new) YAML file.
  • shelephant_merge: merge two YAML-files.
  • shelephant_parse: parse a YAML-files and print to screen.

Disclaimer

This library is free to use under the MIT license. Any additions are very much appreciated, in terms of suggested functionality, code, documentation, testimonials, word-of-mouth advertisement, etc. Bug reports or feature requests can be filed on GitHub. As always, the code comes with no guarantee. None of the developers can be held responsible for possible mistakes.

Download: .zip file | .tar.gz file.

(c - MIT) T.W.J. de Geus (Tom) | tom@geus.me | www.geus.me | github.com/tdegeus/shelephant

Getting shelephant

Using conda

conda install -c conda-forge shelephant

This will also download and install all necessary dependencies.

Using PyPi

pip install shelephant

This will also download and install the necessary Python modules.

From source

# Download shelephant
git checkout https://github.com/tdegeus/shelephant.git
cd shelephant

# Install
python -m pip install .

This will also download and install the necessary Python modules.

Detailed examples

Get files from remote, allowing restarts

Suppose that we want to copy all *.txt files from a certain directory /path/where/files/are/stored on a remote host hostname.

First step, collect information on the host:

# connect to the host
ssh hostname

# go the relevant location at the host
cd "/path/where/files/are/stored/on/remote"

# list files to copy
shelephant_dump -o files_to_copy.yaml *.txt

# optional but useful, get the checksum of the files to copy
shelephant_checksum -o files_checksum.yaml files_to_copy.yaml

# disconnect
exit # or press Ctrl + D

Second step, copy files to the local system, collecting everything in a single place:

# go to the relevant location on the local system
# (often this is new directory)
cd "/path/where/to/copy/to"

# get the file-information compiled on the host
# and store in a (temporary) local file
# note that all paths are on the remote system,
# and that they are now copied using secure-copy (scp)
shelephant_hostinfo \
    -o remote_info.yaml \
    --host "hostname" \
    --prefix "/path/where/files/are/stored/on/remote" \
    --files "files_to_copy.yaml " \
    --checksum "files_checksum.yaml"

# finally, get the files using secure copy
# (the files are stored relative to the path of 'remote_info.yaml',
# identically to how they are relative to 'files_to_copy.yaml' on remote)
shelephant_get remote_info.yaml

If you use the default filenames for shelephant_dump (shelephant_dump.yaml) and shelephant_checksum (shelephant_checksum.yaml) remotely, you can also specify --files and --checksum without an argument.

An interesting benefit that derives from having computed the checksums on the host, is that shelephant_get can be stopped and restarted: only files that do not exist locally, or that were only partially copied (whose checksum does not match the remotely computed checksum), will be copied; all fully copied files will be skipped.

Let's further illustrate with a complete example. On the host, suppose that we have

/path/where/files/are/stored/on/remote
- foo.txt
- bar.txt

This will give, files_to_copy.yaml:

- foo.txt
- bar.txt

files_checksum.yaml (for example):

- 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
- fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9

This information will be collected to remote_info.yaml

host: hostname
root: /path/where/files/are/stored/on/remote
files:
    - foo.txt
    - bar.txt
checksum:
    - 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
    - fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9

shelephant_get will now copy foo.txt and bar.txt relative to the directory of remote_info.yaml (in this case in the same folder as remote_info.yaml). It will skip any files whose filename and checksum match to target ones.

Avoid recomputing checksums

Suppose that we want to restart multiple times, or that we update the files present on the remote after copying them initially. In that case, we can use previously computed checksums to avoid recomputing them (which can be costly for large files).

First step, update information on the host:

# connect to the host
ssh hostname

# go the relevant location at the host
cd "/path/where/files/are/stored/on/remote"

# collect the previously computed information
shelephant_hostinfo -o precomputed_checksums.yaml -f files_to_copy.yaml -c files_checksum.yaml

# list files to copy
shelephant_dump -o files_to_copy.yaml *.txt

# get the checksum of the files to copy, where possible reading precomputed values
shelephant_checksum -o files_checksum.yaml files_to_copy.yaml -l precomputed_checksums.yaml

# disconnect
exit # or press Ctrl + D

Second step, copy files to the local system, collecting everything in a single place:

# go to the relevant location on the local system
# (often this is new directory)
cd "/path/where/to/copy/to"

# collect the previously computed information
shelephant_hostinfo -o precomputed_checksums.yaml -f files_present.yaml -c files_checksum.yaml

# list files currently present locally
shelephant_dump -o files_present.yaml *.txt

# get the checksum of the files to copy, where possible reading precomputed values
shelephant_checksum -o files_checksum.yaml files_present.yaml -l precomputed_checksums.yaml

# combine local files and checksums
shelephant_hostinfo -o precomputed_checksums.yaml -f files_present.yaml -c files_checksum.yaml

# get the file-information compiled on the host [as before]
shelephant_hostinfo \
    -o remote_info.yaml \
    --host "hostname" \
    --prefix "/path/where/files/are/stored/on/remote" \
    --files "files_to_copy.yaml " \
    --checksum "files_checksum.yaml"

# get the files using secure copy
# use the precomputed checksums instead of computing them
shelephant_get remote_info.yaml --local "precomputed_checksums.yaml"

Send files to host

Basic copy

Suppose that we want to copy all *.txt files from a certain local directory /path/where/files/are/stored/locally, to a remote host hostname.

First, we will collect information locally:

# go the relevant location (locally)
cd /path/where/files/are/stored/locally

# list files to copy
shelephant_dump -o files_to_copy.yaml *.txt

Then, we will specify some basic information about the host

# specify basic information about the host
# and store in a (temporary) local file
shelephant_hostinfo \
    -o remote_info.yaml \
    --host "hostname" \
    --prefix "/path/where/to/copy/to/on/remote" \

Now we can copy the files:

shelephant_send files_to_copy.yaml remote_info.yaml

Restart

Suppose that copying was interrupted before completing. We can avoid recopying by again using the checksums. We therefore need to know which files are already present remotely and which checksum they have. Thereto:

# connect to the host
ssh hostname

# go the relevant location at the host
cd "/path/where/to/copy/to/on/remote"

# list files to copy
shelephant_dump -o files_to_copy.yaml *.txt

# get the checksum of the files to copy
shelephant_checksum -o files_checksum.yaml files_to_copy.yaml

# disconnect
exit # or press Ctrl + D

Now we will complement the basic host-info:

shelephant_hostinfo \
    -o remote_info.yaml \
    --host "hostname" \
    --prefix "/path/where/to/copy/to/on/remote" \
    --files "files_to_copy.yaml " \
    --checksum "files_checksum.yaml"

And restart the partial copy:

shelephant_send files_to_copy.yaml remote_info.yaml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shelephant-0.24.2.tar.gz (58.7 kB view details)

Uploaded Source

Built Distribution

shelephant-0.24.2-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file shelephant-0.24.2.tar.gz.

File metadata

  • Download URL: shelephant-0.24.2.tar.gz
  • Upload date:
  • Size: 58.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for shelephant-0.24.2.tar.gz
Algorithm Hash digest
SHA256 3e94187ed1b5f28907af62f14585eb037e1b5cdf90f25a45e6fed2edeff3d129
MD5 7014cf4a5f50b5580cca4c3bce2fe26c
BLAKE2b-256 7ed458fb0b70301d635572b4ae31911a8c2a7ae9af8e4644699ae9aed5090ab9

See more details on using hashes here.

File details

Details for the file shelephant-0.24.2-py3-none-any.whl.

File metadata

  • Download URL: shelephant-0.24.2-py3-none-any.whl
  • Upload date:
  • Size: 41.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for shelephant-0.24.2-py3-none-any.whl
Algorithm Hash digest
SHA256 84c6d73d8a1bc231177fae3dbd6c2e95d32aa8b3912a17fb7e4385f476a774e4
MD5 54cdfbf4239613e0ee415485bd703655
BLAKE2b-256 ec027b1c14538ffa7608041b3bdf47fad3c5bf9b3868bc5582d5063e8220e66e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page