Provides a set of utilities for comparing and backing up data on different filesystems
Project description
Snapshooter (fsspec folder backup and restore tooling)
Provides a set of utilities for diffing and syncing files between two fsspec file systems and performing efficient incremental backups.
Installation
pip install snapshooter
Usage
from snapshooter import Snapshotter
# Create a snapshotter object
snapshotter = Snapshooter(
src_fs=fsspec.filesystem("file"),
src_root=f"./data/restored",
snap_fs=fsspec.filesystem("file"),
snap_root=f"./data/snap",
heap_fs=fsspec.filesystem("file"),
heap_root=f"./data/heap",
)
# Generate a snapshot of the current state of the source file system
snapshot, timestamp = snapshooter.make_snapshot()
# As a result, the files are copied from the source file system to the heap file system and the snapshot is created in memory
# Save the snapshot to the snapshot file system
snapshooter._save_snapshot(snapshot, timestamp)
# Restore the snapshot from the snapshot file system to the source file system
restore_snapshooter.restore_snapshot(snapshot)
with the following parameters:
src_fs
: The file system to be backed upsrc_root
: The root folder in the source file system to be backed upsnap_fs
: The file system to store the snapshots in - the snapshots store the file information as provided by the fsspec file system. Two changes are applied:- The file name is changed to be relative to the
src_root
folder - An additional
md5
field is added, containing the md5 hash of the file contents. This allows for efficient diffing of files.
- The file name is changed to be relative to the
snap_root
: The root folder in the snapshot file system to store the snapshots inheap_fs
: The file system to store the heap in - the heap stores the file contents into file with the md5 hash of the file as the file name. This allows for efficient deduplication of files.heap_root
: The root folder in the heap file system to store the heap in
Supported file systems
The current version has been developed and tested with local and azure file systems. If you have a use case for another file system, please look at fsspec_utils.py / import get_md5_getter
: You will need to implement a new FSSpecMD5Getter
function for your file system and add it to the md5_getter_by_fs_protocol
dictionary. Pull requests are welcome.
About the delta implementation
The snapshooter.generate_snapshot
tries to avoid copying files from the source file system to the heap file system by checking whether a file with the same md5 hash exists in the heap file system. If a file is found, then the src file is not copied to the heap file system. This allows for efficient incremental backups.
In the azure file system, the md5 hash of the file contents is not always available. In that case, the file is downloaded the first time and the md5 hash is calculated and stored in the snapshot. The subsequent calls to snapshooter.generate_snapshot
will then use the etag
attribute of the file (which is always available) and compare it with the value in the previous snapshot: If the etag
matches, the file is not downloaded and the md5 hash of the previous snapshot is reused. This allows for efficient incremental backups.
In the local file system, the md5 is basically not available. The previous incremental backups is also used here. But instead of the etag
attribute, the mtime
attribute is used.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for snapshooter-0.0.27-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e7e7c13ba027908754633795533cbfa2b78832a30bffce0e7bbd159ea1679d5 |
|
MD5 | a0a1307a3e56f767b21ca3242eab4798 |
|
BLAKE2b-256 | 38b55d587b12ad4ac83b42b1bc41d02ba3e889273ee1b1a7cc98992c02e8a6b2 |