Skip to main content

Provides a set of utilities for comparing and backing up data on different filesystems

Project description

CI/CD Pipeline Coverage

Snapshooter (fsspec folder backup and restore tooling)

Snapshooter is a tool to back up and restore a folder in a fsspec file system. It is designed to be used with the fsspec file system library, which provides a unified interface to various file systems (e.g. local, azure, s3, ...).

Key features:

  • Backup side is decomposed into two parts: The snapshots and the heap
    • Snapshots: The snapshots store the files information as provided by the fsspec file system, with the following transformations:
      • The metadata object is jsonified
      • The file name of each file is changed to be relative to the src_root folder
      • An additional md5 field is added, containing the md5 hash of the file contents. This allows for efficient diffing of files.
    • Heap: The heap folder stores the file contents into file with the md5 hash of the file as the file name. This allows for efficient deduplication of files.
  • Efficient incremental backups: Only files that are unknown are copied to the heap file system.

Installation

pip install snapshooter

Usage with CLI

make a snapshot

snapshooter \
  --file-root tests/unit_test_data/sample_src \
  --heap-root tests/temp/sample_heap \
  --snap-root tests/temp/sample_snap \
  make-snapshot

restore the latest snapshot

snapshooter \
  --file-root tests/unit_test_data/restored_src \
  --heap-root tests/temp/sample_heap \
  --snap-root tests/temp/sample_snap \
  restore-snapshot

restore the latest snapshot before or at a given timestamp

snapshooter \
  --file-root tests/unit_test_data/restored_src \
  --heap-root tests/temp/sample_heap \
  --snap-root tests/temp/sample_snap \
  restore-snapshot \
  --latest 2021-09-01T00:00:00  

restore a specific snapshot

snapshooter \
  --file-root tests/unit_test_data/restored_src \
  --heap-root tests/temp/sample_heap \
  --snap-root tests/temp/sample_snap \
  restore-snapshot \
  --path tests/temp/sample_snap/2024/04/2024-04-10_14-30-40_086601Z.jsonl.gz  

list snapshots

snapshooter \
  --file-root tests/unit_test_data/restored_src \
  --heap-root tests/temp/sample_heap \
  --snap-root tests/temp/sample_snap \
  list-snapshots

compare snapshots

snapshooter \
  --file-root tests/unit_test_data/sample_src \
  --heap-root tests/temp/sample_heap \
  --snap-root tests/temp/sample_snap \
  compare-snapshots \
  --path1 tests/temp/sample_snap/2024/04/2024-04-10_14-30-40_086601Z.jsonl.gz \

Snapshot can be identified by the options path1, latest1, path2, latest2, in the same way as in the command restore-snapshot.

support for storage options (e.g. to pass credentials)

Example: Azure file system, See accepted syntax directly in the adlfs documentation: https://pypi.org/project/adlfs/.

az login
snapshooter \
  --file-root az://file-container/file-root \
  --heap-root az://heap-container/heap-root \
  --snap-root az://snap-container/snap-root \
  --file-storage-options '{"account_name": "fileaccountname"}' \
  --heap-storage-options '{"account_name": "heapaccountname"}' \
  --snap-storage-options '{"account_name": "snapaccountname"}' \
  make-snapshot

Usage with Python

See the CLI implementation here for an example of how to use the Snapshooter class.

Supported file systems

The current version has been developed and tested with local and azure file systems.

For other file systems: a single function needs to be implemented and added to the md5_getter_by_fs_protocol dictionary in fsspec_utils.py. This function takes as input the current metadata of a file and the latest snapshot and should return the md5 hash of the file contents if it can be retrieved without downloading the file. If the md5 hash cannot be retrieved, the function should return None and the file will be downloaded.

Pull requests are welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snapshooter-1.0.17.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snapshooter-1.0.17-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file snapshooter-1.0.17.tar.gz.

File metadata

  • Download URL: snapshooter-1.0.17.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for snapshooter-1.0.17.tar.gz
Algorithm Hash digest
SHA256 d36068bfad8a6ab2e434f7acb68d8165fd4cf436009148df4e17d0fe691c01f1
MD5 10e4f7ce96e1791885983e610b9ffa06
BLAKE2b-256 993f400108618ad72c2e54824b4549b7c7ae10b1b7006b9aa8496ea5f691e1ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for snapshooter-1.0.17.tar.gz:

Publisher: cicd.yml on jeromerg/snapshooter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snapshooter-1.0.17-py3-none-any.whl.

File metadata

  • Download URL: snapshooter-1.0.17-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for snapshooter-1.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 7e52a922307b32481da0a50958e5753ba4e295b731dbf3cb988a32ff046e61a6
MD5 ce1e527efcb386dd8e17a3bc7bb666bd
BLAKE2b-256 905f319e9a4af71fb79289881aa37fb30ecf44bdc86e997838623074f9b3dec8

See more details on using hashes here.

Provenance

The following attestation bundles were made for snapshooter-1.0.17-py3-none-any.whl:

Publisher: cicd.yml on jeromerg/snapshooter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page