Optimized file transfer and compression for large files on Google Cloud Storage
Project description
gs_fastcopy (python)
Optimized file copying & compression for large files on Google Cloud Storage.
TLDR:
import gs_fastcopy
import numpy as np
with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']
Provides file-like interfaces for:
- Parallel, XML multipart uploads to Cloud Storage.
- Parallel, sliced downloads from Cloud Storage using
gcloud storage
. - Parallel (de)compression using
pigz
andunpigz
if available (with fallback to standardgzip
andgunzip
).
Together, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.
[!Note]
This benchmark is being tested more rigorously, stay tuned.
Examples
gs_fastcopy
is easy to use for reading and writing files.
You can use it without compression:
with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']
gs_fastcopy
also handles gzip compression transparently. Note that we don't use numpy's savez_compressed
:
with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))
with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']
Caveats & limitations
-
You need a
__main__
guard.Subprocesses spawned during parallel processing re-interpret the main script. This is bad if the main script then spawns its own subprocesses…
See also gs-fastcopy-python#5 with a further note on freezing scripts into executables.
-
You need a filesystem.
Because
gs_fastcopy
uses tools that work with files, it must be able to read/write a filesystem, in particular temporary files as set up bytempfile.TemporaryDirectory()
[python docs].This is surprisingly versatile: even "very" serverless environments like Cloud Functions present an in-memory file system.
-
You need the
gcloud
SDK on your path.Or, at least the
gcloud storage
component of the SDK.gs_fastcloud
usesgcloud
to download files.Issue #2 considers falling back to Python API downloads if the specialized tools aren't available.
-
You need enough disk space for the compressed & uncompressed files, together.
Because
gs_fastcloud
writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files while the operation is in progress.- Reads from Cloud: (1) fetch to temp file; (2) decompress if gzipped; (3) stream temp file to Python app via
read()
; (4) delete the temp file - Writes to Cloud: (1) app writes to temp file; (2) compress if gzipped; (3) upload temp file to Google Cloud; (4) delete the temp file
- Reads from Cloud: (1) fetch to temp file; (2) decompress if gzipped; (3) stream temp file to Python app via
Why gs_fastcopy
APIs for Google Storage (GS) typically present File
-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like gzip
along the way.
Libraries like smart_open
add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for .gz
files. Quite delightful!
Unfortunately, these approaches are single-threaded. We noticed that transfer time for files sized many 100s of MBs was lower than expected. @lynnlangit pointed me toward the composite upload feature in gcloud storage cp
. A "few" hours later, gs_fastcopy
came to be.
Why both gcloud
and XML multi-part
I'm glad you asked! I initially implemented this just with gcloud
's composite uploads. But the documentation gave a few warnings about composite uploads.
[!Warning]
Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:
- Because other storage classes are subject to early deletion fees, you should always use Standard storage for temporary objects. Once the final object is composed, you can change its storage class.
- You should not use parallel composite uploads when uploading to a bucket that has a retention policy, because the temporary objects can't be deleted until they meet the retention period.
- If the bucket you upload to has default object holds enabled, you must release the hold from each temporary object before you can delete it.
Basically, composite uploads leverage independent API functions, whereas XML multi-part is a managed operation. The managed operation plays more nicely with other features like retention policies. On the other hand, because it's separate, the XML multi-part API needs additional permissions. (We may need to fall back to gcloud
in that case!)
On top of being "weird" in these ways, composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: High throughput file transfers with Google Cloud Storage (GCS). TLDR, gcloud
sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gs_fastcopy-1.0a6.tar.gz
.
File metadata
- Download URL: gs_fastcopy-1.0a6.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fca1c98a987602daab8e1247fd7cc4b041bcb20e75552a0c2ddd1b909a19673f |
|
MD5 | 0b529f081bd18cec5e8342f64d5f2498 |
|
BLAKE2b-256 | 57b7a2b78a6551059f20acd37493e44a1d920d12d25d78a022db03c3dc51f190 |
File details
Details for the file gs_fastcopy-1.0a6-py3-none-any.whl
.
File metadata
- Download URL: gs_fastcopy-1.0a6-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e007a7511fd1da3e6d4081d6e7922dc36569ee9e980cc2d1295217234d1b4cc1 |
|
MD5 | 53b6860cbd0c8fb183d069a1893b824f |
|
BLAKE2b-256 | 6a7979962a9f94c8a875e0e5fd61f40e3886bf0812531ae4a6614171678ed5a9 |