Skip to main content

Optimized file transfer and compression for large files on Google Cloud Storage

Project description

gs_fastcopy (python)

Optimized file copying & compression for large files on Google Cloud Storage.

TLDR:

import gs_fastcopy
import numpy as np

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))

with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

Provides file-like interfaces for:

  • Parallel, XML multipart uploads to Cloud Storage.
  • Parallel, sliced downloads from Cloud Storage using gcloud storage.
  • Parallel (de)compression using pigz and unpigz if available (with fallback to standard gzip and gunzip).

Together, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.

[!Note]

This benchmark is being tested more rigorously, stay tuned.

Examples

gs_fastcopy is easy to use for reading and writing files.

You can use it without compression:

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

gs_fastcopy also handles gzip compression transparently. Note that we don't use numpy's savez_compressed:

with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']

Caveats & limitations

  • You need a __main__ guard.

    Subprocesses spawned during parallel processing re-interpret the main script. This is bad if the main script then spawns its own subprocesses…

    See also gs-fastcopy-python#5 with a further note on freezing scripts into executables.

  • You need a filesystem.

    Because gs_fastcopy uses tools that work with files, it must be able to read/write a filesystem, in particular temporary files as set up by tempfile.TemporaryDirectory() [python docs].

    This is surprisingly versatile: even "very" serverless environments like Cloud Functions present an in-memory file system.

  • You need the gcloud SDK on your path.

    Or, at least the gcloud storage component of the SDK.

    gs_fastcloud uses gcloud to download files.

    Issue #2 considers falling back to Python API downloads if the specialized tools aren't available.

  • You need enough disk space for the compressed & uncompressed files, together.

    Because gs_fastcloud writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files while the operation is in progress.

    • Reads from Cloud: (1) fetch to temp file; (2) decompress if gzipped; (3) stream temp file to Python app via read(); (4) delete the temp file
    • Writes to Cloud: (1) app writes to temp file; (2) compress if gzipped; (3) upload temp file to Google Cloud; (4) delete the temp file

Why gs_fastcopy

APIs for Google Storage (GS) typically present File-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like gzip along the way.

Libraries like smart_open add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for .gz files. Quite delightful!

Unfortunately, these approaches are single-threaded. We noticed that transfer time for files sized many 100s of MBs was lower than expected. @lynnlangit pointed me toward the composite upload feature in gcloud storage cp. A "few" hours later, gs_fastcopy came to be.

Why both gcloud and XML multi-part

I'm glad you asked! I initially implemented this just with gcloud's composite uploads. But the documentation gave a few warnings about composite uploads.

[!Warning]

Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:

  • Because other storage classes are subject to early deletion fees, you should always use Standard storage for temporary objects. Once the final object is composed, you can change its storage class.
  • You should not use parallel composite uploads when uploading to a bucket that has a retention policy, because the temporary objects can't be deleted until they meet the retention period.
  • If the bucket you upload to has default object holds enabled, you must release the hold from each temporary object before you can delete it.

Basically, composite uploads leverage independent API functions, whereas XML multi-part is a managed operation. The managed operation plays more nicely with other features like retention policies. On the other hand, because it's separate, the XML multi-part API needs additional permissions. (We may need to fall back to gcloud in that case!)

On top of being "weird" in these ways, composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: High throughput file transfers with Google Cloud Storage (GCS). TLDR, gcloud sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gs_fastcopy-1.0a6.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

gs_fastcopy-1.0a6-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file gs_fastcopy-1.0a6.tar.gz.

File metadata

  • Download URL: gs_fastcopy-1.0a6.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for gs_fastcopy-1.0a6.tar.gz
Algorithm Hash digest
SHA256 fca1c98a987602daab8e1247fd7cc4b041bcb20e75552a0c2ddd1b909a19673f
MD5 0b529f081bd18cec5e8342f64d5f2498
BLAKE2b-256 57b7a2b78a6551059f20acd37493e44a1d920d12d25d78a022db03c3dc51f190

See more details on using hashes here.

File details

Details for the file gs_fastcopy-1.0a6-py3-none-any.whl.

File metadata

  • Download URL: gs_fastcopy-1.0a6-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for gs_fastcopy-1.0a6-py3-none-any.whl
Algorithm Hash digest
SHA256 e007a7511fd1da3e6d4081d6e7922dc36569ee9e980cc2d1295217234d1b4cc1
MD5 53b6860cbd0c8fb183d069a1893b824f
BLAKE2b-256 6a7979962a9f94c8a875e0e5fd61f40e3886bf0812531ae4a6614171678ed5a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page