File cache for files retrieved from the cloud
Project description
Introduction
filecache
is a Python module that abstracts away the location where files used or
generated by a program are stored. Files can be on the local file system, in Google Cloud
Storage, on Amazon Web Services S3, or on a webserver. When files to be read are on the
local file system, they are simply accessed in-place. Otherwise, they are downloaded from
the remote source to a local temporary directory. When files to be written are on the
local file system, they are simply written in-place. Otherwise, they are written to a
local temporary directory and then uploaded to the remote location (it is not possible to
upload to a webserver). When a cache is no longer needed, it is deleted from the local
disk.
filecache
is a product of the PDS Ring-Moon Systems Node.
Installation
The filecache
module is available via the rms-filecache
package on PyPI and can be
installed with:
pip install rms-filecache
Getting Started
The top-level file organization is provided by the FileCache
class. A FileCache
instance is used to specify a particular sharing policy and lifetime. For example,
a cache could be private to the current process and group a set of files that all have the
same basic purpose. Once these files have been (downloaded and) read, they are deleted as
a group. Another cache could be shared among all processes on the current machine and
group a set of files that are needed by multiple processes, thus allowing them to be
downloaded from a remote source only one time, saving time and bandwidth.
A FileCache
contains one or more FileCachePrefix
instances that each define access to
a local or remote source/destination for files. For example, one instance could be
used to access the local filesystem, while another could be used to access a particular
AWS S3 bucket.
Usage examples:
from filecache import FileCache
with FileCache() as fc: # Context manager
# Use GS by specifying the bucket name and one directory level
pfx1 = fc.new_prefix('gs://rms-filecache-tests/subdir1')
# Use S3 by specifying the bucket namd and two directory levels
pfx2 = fc.new_prefix('s3://rms-filecache-tests/subdir1/subdir2a')
# Access GS using a directory + filename (since only one directory level
# was specified by the prefix)
with pfx1.open('subdir2a/binary1.bin', 'rb') as fp:
bin1 = fp.read()
# Access S3 using a filename only (since two directory levels were already
# specified by the prefix))
with pfx2.open('binary1.bin', 'rb') as fp:
bin2 = fp.read()
assert bin1 == bin2
# Cache automatically deleted here
# Same as above example but not using context managers for FileCache
fc = FileCache()
pfx1 = fc.new_prefix('gs://rms-filecache-tests/subdir1')
pfx2 = fc.new_prefix('s3://rms-filecache-tests/subdir1/subdir2a')
path1 = pfx1.retrieve('subdir2a/binary1.bin')
with open(path1, 'rb') as fp:
bin1 = fp.read()
path2 = pfx2.retrieve('binary1.bin')
with open(path2, 'rb') as fp:
bin2 = fp.read()
fc.clean_up() # Cache manually deleted here
assert bin1 == bin2
# Write a file to a bucket and read it back
with FileCache() as fc:
pfx = fc.new_prefix('gs://my-writable-bucket')
with pfx.open('output.txt', 'w') as fp:
fp.write('A')
# The cache will be deleted here so the file will have to be downloaded
with FileCache() as fc:
pfx = fc.new_prefix('gs://my-writable-bucket')
with pfx.open('output.txt', 'r') as fp:
print(fp.read())
A benefit of the abstraction is that different environments can access the same files in
different ways without needing to change the program code. For example, consider a program
that needs to access the file COISS_2xxx/COISS_2001/voldesc.cat
from the NASA PDS
archives. This file might be stored on the local disk in the user's home directory in a
subdirectory called pds3-holdings
. Or if the user does not have a local copy, it is
accessible from a webserver at
https://pds-rings.seti.org/holdings/volumes/COISS_2xxx/COISS_2001/voldesc.cat
.
Finally, it could be accessible from Google Cloud Storage from the rms-node-holdings
bucket at
gs://rms-node-holdings/pds3-holdings/volumes/COISS_2xxx/COISS_2001/voldesc.cat
. Before
running the program, an environment variable could be set to one of these values:
$ export PDS3_HOLDINGS_DIR="~/pds3-holdings"
$ export PDS3_HOLDINGS_DIR="https://pds-rings.seti.org/holdings"
$ export PDS3_HOLDINGS_DIR="gs://rms-node-holdings/pds3-holdings"
Then the program could be written as::
from filecache import FileCache
import os
with FileCache() as fc:
pfx = fc.new_prefix(os.getenv('PDS3_HOLDINGS_DIR'))
with pfx.open('volumes/COISS_2xxx/COISS_2001/voldesc.cat', 'r') as fp:
contents = fp.read()
# Cache automatically deleted here
If the program was going to be run multiple times in a row, or multiple copies were going to be run simultaneously, marking the cache as shared would allow all of the processes to share the same copy, thus requiring only a single download no matter how many times the program was run:
from filecache import FileCache
import os
with FileCache(shared=True) as fc:
pfx = fc.new_prefix(os.getenv('PDS3_HOLDINGS_DIR'))
with pfx.open('volumes/COISS_2xxx/COISS_2001/voldesc.cat', 'r') as fp:
contents = fp.read()
# Cache not deleted here; must be deleted manually using fc.clean_up(final=True)
# If not deleted manually, the shared cache will persist until the temporary
# directory is purged by the operating system (which may be never)
Details of each class are available in the module documentation.
Contributing
Information on contributing to this package can be found in the Contributing Guide.
Links
Licensing
This code is licensed under the Apache License v2.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rms_filecache-1.0.0.tar.gz
.
File metadata
- Download URL: rms_filecache-1.0.0.tar.gz
- Upload date:
- Size: 170.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60b26b5c2660fdbda49e6dbc21eda6627db1cf3fd78e2eb1f9f24740fdc35173 |
|
MD5 | 609e85678228550a705e19c3b4c02606 |
|
BLAKE2b-256 | 482c6db35f27a8f211bf43781d1988e5866109614e0ffa0589edf3f9ac415090 |
Provenance
File details
Details for the file rms_filecache-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: rms_filecache-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5714428ad5218464a7e4487bb70813af082916917e89aca7babf645a70e61066 |
|
MD5 | 5d18198b3a2767e8eb3561d5b66adf5f |
|
BLAKE2b-256 | fc8188715c987a2731aa45be28f993b6f0995054cd4f88bdf57bccf77ca79b85 |