Optimized file transfer and compression for large files on Google Cloud Storage
Reason this release was yanked:
buggy 😓
Project description
gs_fastcopy (python)
Optimized file copying & compression for large files on Google Cloud Storage.
TLDR:
import gs_fastcopy
import numpy as np
with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']
Provides file-like interfaces for:
- Parallel, XML multipart uploads to Cloud Storage.
- Parallel, sliced downloads from Cloud Storage using
gcloud storage
. - Parallel (de)compression using
pigz
andunpigz
if available (with fallback to standardgzip
andgunzip
).
Together, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.
[!Note]
This benchmark is being tested more rigorously, stay tuned.
Examples
gs_fastcopy
is easy to use for reading and writing files.
Without compression:
with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']
With compression: note that we don't use savez_compressed
:
with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))
with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']
Caveats & limitations
-
You need a filesystem.
Because
gs_fastcloud
uses tools that work with files, it must be able to read/write files, in particular temporary files as set up bytempfile.TemporaryDirectory()
.This is surprisingly versatile, even "very" serverless environments like Cloud Functions present an in-memory file system.
-
You need the
gcloud
SDK on your path.Or, at least the
gcloud storage
component of the SDK.gs_fastcloud
usesgcloud
to download files.#2 considers falling back to Python API downloads.
-
You need enough disk space for the compressed & uncompressed files, together.
Because
gs_fastcloud
writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files before the operation completes.
Why gs_fastcopy
APIs for Google Storage (GS) typically present File
-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like gzip
along the way.
Libraries like smart_open
add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for .gz
files. Quite delightful!
Unfortunately, these approaches are single-threaded. We noticed that transfer time for files sized many 100s of MBs was lower than expected. @lynnlangit pointed me toward the composite upload feature in gcloud storage cp
. A "few" hours later, gs_fastcopy
came to be.
Why both gcloud
and XML multi-part
I'm glad you asked! I initially implemented this just with gcloud
's composite uploads. But the documentation gave a few warnings about composite uploads.
[!Warning]
Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:
- Because other storage classes are subject to early deletion fees, you should always use Standard storage for temporary objects. Once the final object is composed, you can change its storage class.
- You should not use parallel composite uploads when uploading to a bucket that has a retention policy, because the temporary objects can't be deleted until they meet the retention period.
- If the bucket you upload to has default object holds enabled, you must release the hold from each temporary object before you can delete it.
Basically, composite uploads leverage API pieces, whereas XML multi-part is a dedicated function that understands the chunk files on GCS are special.
On the other hand, the XML multi-part API does require some permissions. (We may need to fall back to gcloud
in that case!)
On top of being "weird", composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: High throughput file transfers with Google Cloud Storage (GCS). TLDR, gcloud
sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gs_fastcopy-1.0a2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9852eac69cd4591d650ef69657d6ecdb9eb8a107b9bd628211ff0acf9a96bbe |
|
MD5 | 366a122b299dbcf649287d2afac4eac7 |
|
BLAKE2b-256 | faad6adcae026faec4e445f84578e9821d34045c1b5c2bcb2e89e5bd5ea5e66b |