Optimized file transfer and compression for large files on Google Cloud Storage

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Reason this release was yanked:

buggy 😓

Project description

gs_fastcopy (python)

Optimized file copying & compression for large files on Google Cloud Storage.

TLDR:

import gs_fastcopy
import numpy as np

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))

with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

Provides file-like interfaces for:

Parallel, XML multipart uploads to Cloud Storage.
Parallel, sliced downloads from Cloud Storage using gcloud storage.
Parallel (de)compression using pigz and unpigz if available (with fallback to standard gzip and gunzip).

Together, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.

[!Note]

This benchmark is being tested more rigorously, stay tuned.

Examples

gs_fastcopy is easy to use for reading and writing files.

Without compression:

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

With compression: note that we don't use savez_compressed:

with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

Caveats & limitations

You need a filesystem.

Because gs_fastcloud uses tools that work with files, it must be able to read/write files, in particular temporary files as set up by tempfile.TemporaryDirectory().

This is surprisingly versatile, even "very" serverless environments like Cloud Functions present an in-memory file system.
You need the gcloud SDK on your path.

Or, at least the gcloud storage component of the SDK.

gs_fastcloud uses gcloud to download files.

#2 considers falling back to Python API downloads.
You need enough disk space for the compressed & uncompressed files, together.

Because gs_fastcloud writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files before the operation completes.

Why gs_fastcopy

APIs for Google Storage (GS) typically present File-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like gzip along the way.

Libraries like smart_open add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for .gz files. Quite delightful!

Unfortunately, these approaches are single-threaded. We noticed that transfer time for files sized many 100s of MBs was lower than expected. @lynnlangit pointed me toward the composite upload feature in gcloud storage cp. A "few" hours later, gs_fastcopy came to be.

Why both `gcloud` and XML multi-part

I'm glad you asked! I initially implemented this just with gcloud's composite uploads. But the documentation gave a few warnings about composite uploads.

[!Warning]

Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:

Because other storage classes are subject to early deletion fees, you should always use Standard storage for temporary objects. Once the final object is composed, you can change its storage class.

You should not use parallel composite uploads when uploading to a bucket that has a retention policy, because the temporary objects can't be deleted until they meet the retention period.

If the bucket you upload to has default object holds enabled, you must release the hold from each temporary object before you can delete it.

Basically, composite uploads leverage API pieces, whereas XML multi-part is a dedicated function that understands the chunk files on GCS are special.

On the other hand, the XML multi-part API does require some permissions. (We may need to fall back to gcloud in that case!)

On top of being "weird", composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: High throughput file transfers with Google Cloud Storage (GCS). TLDR, gcloud sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Release history Release notifications | RSS feed

1.0a4 pre-release

Jul 23, 2024

1.0a3 pre-release

Jul 8, 2024

1.0a2 pre-release yanked

Jul 8, 2024

Reason this release was yanked:

buggy 😓

This version

1.0a1 pre-release yanked

Jul 8, 2024

Reason this release was yanked:

buggy 😓

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gs_fastcopy-1.0a1.tar.gz (7.0 kB view hashes)

Uploaded Jul 8, 2024 Source

Built Distribution

gs_fastcopy-1.0a1-py3-none-any.whl (6.8 kB view hashes)

Uploaded Jul 8, 2024 Python 3

Hashes for gs_fastcopy-1.0a1.tar.gz

Hashes for gs_fastcopy-1.0a1.tar.gz
Algorithm	Hash digest
SHA256	`96f2ddc3e1f6964aceae36b166be00d8637a3bc4112b82af8eb6acf37fadc95e`
MD5	`5f6b95b1dae46f6a8321659a3eb82963`
BLAKE2b-256	`4ee69d7c39196bbaa1aab36ca3b10db54c6f3129f227abf5aee4a74e11161618`

Hashes for gs_fastcopy-1.0a1-py3-none-any.whl

Hashes for gs_fastcopy-1.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`270aab2fa787e5d678af31a0725e4a9dd58724a565892019e8f2f934645ae36c`
MD5	`3bf806feab5515cee0be89ef6ecaed12`
BLAKE2b-256	`b3047f02f94133ace3f30b3500d3b447f31e90dd66df8ac6f319230b86224364`

gs-fastcopy 1.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

gs_fastcopy (python)

Examples

Caveats & limitations

Why gs_fastcopy

Why both `gcloud` and XML multi-part

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

gs-fastcopy 1.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

gs_fastcopy (python)

Examples

Caveats & limitations

Why gs_fastcopy

Why both gcloud and XML multi-part

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Why both `gcloud` and XML multi-part