Fast access to cloud storage and local FS.
Project description
CloudFiles: Fast access to cloud storage and local FS.
from cloudfiles import CloudFiles
cf = CloudFiles('gs://bucket/') # google cloud storage
cf = CloudFiles('s3://bucket/') # Amazon S3
cf = CloudFiles('file:///home/coolguy/') # local filesystem
cf = CloudFiles('https://website.com/coolguy/') # arbitrary web server
# more options
cf = CloudFiles(
's3://bucket/',
num_threads=20,
progress=True, # display progress bar
secrets=credential_json, # provide your own secrets
green=False, # whether to use green threads
)
cf.get('filename')
cf.get([ 'filename_1', 'filename_2' ]) # threaded automatically
cf.put('filename', content)
cf.put_json('filename', content)
cf.puts([{
'path': 'filename',
'content': content,
}, ... ]) # automatically threaded
cf.put_jsons(...) # same as puts
cf.list()
cf.delete('filename')
cf.delete([ 'filename_1', 'filename_2', ... ]) # threaded
cf.exists('filename')
cf.exists([ filename_1, ... ]) # threaded
CloudFiles is a pure python client for accessing cloud storage or the local file system in a threaded fashion without hassle.
Highlights
- Fast file access due to transparent threading.
- Supports Google Cloud Storage, Amazone S3, local filesystems, and arbitrary web servers with a similar file access structure making hybrid or multi-cloud easy.
- Robust to flaky network connections. Retries using an exponential random window to avoid network collisions when working in a large cluster.
- Supports gzip and brotli* compression.
- Supports HTTP Range reads.
- Supports green threads, which are important for achieving maximum performance on virtualized servers.
* Except on Google Cloud Storage.
Installation
pip install cloud-files
You may wish to install credentials under ~/.cloudvolume/secrets
. See this link for details. CloudFiles is descended from CloudVolume, and for now we'll leave the same configuration structure in place.
Documentation
Note that the "Cloud Costs" mentioned below are current as of June 2020 and are subject to change. As of this writing, S3 and Google use identical cost structures for these operations.
Constructor
# import gevent.monkey
# gevent.monkey.patch_all(thread=False)
from cloudfiles import CloudFiles
cf = CloudFiles(
cloudpath, progress=False,
green=False, secrets=None, num_threads=20
)
- cloudpath: The path to the bucket you are accessing. The path is formatted as
$PROTOCOL://BUCKET/PATH
. Files will then be accessed relative to the path. The protocols supported aregs
(GCS),s3
(AWS S3),file
(local FS), andhttp
/https
. - progress: Whether to display a progress bar when processing multiple items simultaneously.
- green: Use green threads. For this to work properly, you must uncomment the top two lines.
- secrets: Provide secrets dynamically rather than fetching from the credentials directory
$HOME/.cloudvolume/secrets
. - num_threads: Number of simultaneous requests to make. Usually 20 per core is pretty close to optimal unless file sizes are extreme.
get / get_json
binary = cf.get('filename')
>> b'...'
binaries = cf.get(['filename1', 'filename2'])
>> [ { 'path': 'filename1', 'content': b'...', 'byte_range': (None, None), 'error': None }, { 'path': 'filename2', 'content': b'...', 'byte_range': (None, None), 'error': None } ]
binary = cf.get({ 'path': 'filename', 'start': 0, 'end': 1024 })
>> b'...' # represents byte range 0-1024 of filename
get
supports several different styles of input. The simplest takes a scalar filename and returns the contents of that file. However, you can also specify lists of filenames, a byte range request, and lists of byte range requests. You can provide a generator or iterator as input as well.
When more than one file is provided at once, the download will be threaded using preemptive or cooperative (green) threads depending on the green
setting. If progress
is set to true, a progress bar will be displayed that counts down the number of files to download.
get_json
is the same as get but it will parse the returned binary as JSON data encoded as utf8 and returns a dictionary.
Cloud Cost: Usually about $0.40 per million requests.
put / puts / put_json / put_jsons
cf.put('filename', b'content')
cf.put_json('digits', [1,2,3,4,5])
cf.puts([{
'path': 'filename',
'content': b'...',
'content_type': 'application/octet-stream',
'compress': 'gzip',
'compression_level': 6, # parameter for gzip or brotli compressor
'cache_control': 'no-cache',
}])
cf.puts([ (path, content), (path, content) ], compression='gzip')
cf.put_jsons(...)
# Definition of put, put_json is identical
def put(
self,
path, content,
content_type=None, compress=None,
compression_level=None, cache_control=None
)
# Definition of puts, put_jsons is identical
def puts(
self, files,
content_type=None, compress=None,
compression_level=None, cache_control=None
)
The PUT operation is the most complex operation because it's so configurable. Sometimes you want one file, sometimes many. Sometimes you want to configure each file individually, sometimes you want to standardize a bulk upload. Sometimes it's binary data, but oftentimes it's JSON. We provide a simpler interface for uploading a single file put
and put_json
(singular) versus the interface for uploading possibly many files puts
and put_jsons
(plural).
In order to upload many files at once (which is much faster due to threading), you need to minimally provide the path
and content
for each file. This can be done either as a dict containing those fields or as a tuple (path, content)
. If dicts are used, the fields (if present) specified in the dict take precedence over the parameters of the function. You can mix tuples with dicts. The input to puts can be a scalar (a single dict or tuple) or an iterable such as a list, iterator, or generator.
Cloud Cost: Usually about $5 per million files.
delete
cf.delete('filename')
cf.delete([ 'file1', 'file2', ... ])
This will issue a delete request for each file specified in a threaded fashion.
Cloud Cost: Usually free.
exists
cf.exists('filename')
>> True # or False
cf.exists([ 'file1', 'file2', ... ])
>> { 'file1': True, 'file2': False, ... }
Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance.
Cloud Cost: Usually about $0.40 per million requests.
list
cf.list()
cf.list(prefix="abc")
cf.list(prefix="abc", flat=True)
Recall that in object storage, directories do not really exist and file paths are really a key-value mapping. The list
operator will list everything under the cloudpath
given in the constructor. The prefix
operator allows you to efficiently filter some of the results. If flat
is specified, the results will be filtered to return only a single "level" of the "directory" even though directories are fake. The entire set of all subdirectories will still need to be fetched.
Cloud Cost: Usually about $5 per million requests, but each request might list 1000 files. The list operation will continuously issue list requests until all files are listed.
Credits
CloudFiles is derived from the CloudVolume.Storage system.
Storage was initially created by William Silversmith and Ignacio Tartavull. It was maintained and improved by William Silversmith and includes improvements by Nico Kemnitz (extremely fast exists) and Ben Falk (brotli).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cloud-files-0.2.0.tar.gz
.
File metadata
- Download URL: cloud-files-0.2.0.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce5c665aebc93892db05f6c90605dfdcbf2fc6750a4718ca0dcedd3ada7d0d86 |
|
MD5 | bb8750abd9b535b9700edd3c9cb581bf |
|
BLAKE2b-256 | 6f1e1a40ecb9082ce910df2849df580cf81590c7a5845ee88529b318f6e03eee |
File details
Details for the file cloud_files-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: cloud_files-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 24.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a000912e9e35b8fcedf8c674affc1547bf0c612af166e8e0a5204e58a073e7c |
|
MD5 | d8e38cdc020d46b5e516d8d4a3337b18 |
|
BLAKE2b-256 | dc666c604fc6f66881622d3bce748cbd82a6dae49d201ab0b45c74a251ebd4da |