Skip to main content

Optimized file transfer and compression for large files on Google Cloud Storage

Reason this release was yanked:

buggy 😓

Project description

gs_fastcopy (python)

Optimized file copying & compression for large files on Google Cloud Storage.

TLDR:

import gs_fastcopy
import numpy as np

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))

with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

Provides file-like interfaces for:

  • Parallel, XML multipart uploads to Cloud Storage.
  • Parallel, sliced downloads from Cloud Storage using gcloud storage.
  • Parallel (de)compression using pigz and unpigz if available (with fallback to standard gzip and gunzip).

Together, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.

[!Note]

This benchmark is being tested more rigorously, stay tuned.

Examples

gs_fastcopy is easy to use for reading and writing files.

Without compression:

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

With compression: note that we don't use savez_compressed:

with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

Caveats & limitations

  • You need a filesystem.

    Because gs_fastcloud uses tools that work with files, it must be able to read/write files, in particular temporary files as set up by tempfile.TemporaryDirectory().

    This is surprisingly versatile, even "very" serverless environments like Cloud Functions present an in-memory file system.

  • You need the gcloud SDK on your path.

    Or, at least the gcloud storage component of the SDK.

    gs_fastcloud uses gcloud to download files.

    #2 considers falling back to Python API downloads.

  • You need enough disk space for the compressed & uncompressed files, together.

    Because gs_fastcloud writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files before the operation completes.

Why gs_fastcopy

APIs for Google Storage (GS) typically present File-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like gzip along the way.

Libraries like smart_open add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for .gz files. Quite delightful!

Unfortunately, these approaches are single-threaded. We noticed that transfer time for files sized many 100s of MBs was lower than expected. @lynnlangit pointed me toward the composite upload feature in gcloud storage cp. A "few" hours later, gs_fastcopy came to be.

Why both gcloud and XML multi-part

I'm glad you asked! I initially implemented this just with gcloud's composite uploads. But the documentation gave a few warnings about composite uploads.

[!Warning]

Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:

  • Because other storage classes are subject to early deletion fees, you should always use Standard storage for temporary objects. Once the final object is composed, you can change its storage class.
  • You should not use parallel composite uploads when uploading to a bucket that has a retention policy, because the temporary objects can't be deleted until they meet the retention period.
  • If the bucket you upload to has default object holds enabled, you must release the hold from each temporary object before you can delete it.

Basically, composite uploads leverage API pieces, whereas XML multi-part is a dedicated function that understands the chunk files on GCS are special.

On the other hand, the XML multi-part API does require some permissions. (We may need to fall back to gcloud in that case!)

On top of being "weird", composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: High throughput file transfers with Google Cloud Storage (GCS). TLDR, gcloud sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gs_fastcopy-1.0a1.tar.gz (7.0 kB view hashes)

Uploaded Source

Built Distribution

gs_fastcopy-1.0a1-py3-none-any.whl (6.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page