Skip to main content

File cache for files retrieved from the cloud

Project description

GitHub release; latest by date GitHub Release Date Test Status Documentation Status Code coverage
PyPI - Version PyPI - Format PyPI - Downloads PyPI - Python Version
GitHub commits since latest release GitHub commit activity GitHub last commit
Number of GitHub open issues Number of GitHub closed issues Number of GitHub open pull requests Number of GitHub closed pull requests
GitHub License Number of GitHub stars GitHub forks

Introduction

filecache is a Python module that abstracts away the location where files used or generated by a program are stored. Files can be on the local file system, in Google Cloud Storage, on Amazon Web Services S3, or on a webserver. When files to be read are on the local file system, they are simply accessed in-place. Otherwise, they are downloaded from the remote source to a local temporary directory. When files to be written are on the local file system, they are simply written in-place. Otherwise, they are written to a local temporary directory and then uploaded to the remote location (it is not possible to upload to a webserver). When a cache is no longer needed, it is deleted from the local disk.

filecache is a product of the PDS Ring-Moon Systems Node.

Installation

The filecache module is available via the rms-filecache package on PyPI and can be installed with:

pip install rms-filecache

Getting Started

The top-level file organization is provided by the FileCache class. A FileCache instance is used to specify a particular sharing policy and lifetime. For example, a cache could be private to the current process and group a set of files that all have the same basic purpose. Once these files have been (downloaded and) read, they are deleted as a group. Another cache could be shared among all processes on the current machine and group a set of files that are needed by multiple processes, thus allowing them to be downloaded from a remote source only one time, saving time and bandwidth.

A FileCache can be instantiated either directly or as a context manager. When instantiated directly, the programmer is responsible for calling FileCache.delete_cache directly to delete the cache when finished (a non-shared cache will be automatically deleted on program exit). When instantiated as a context manager, a non-shared cache is deleted on exit from the context. See the class documentation for full details.

Usage examples:

from filecache import FileCache
# Create a cache with a unique name that will be deleted on exit
with FileCache(None) as fc:  # Use as context manager
    # Also use open() as a context manager
    with fc.open('gs://rms-filecache-tests/subdir1/subdir2a/binary1.bin', 'rb',
                 anonymous=True) as fp:
        bin1 = fp.read()
    with fc.open('s3://rms-filecache-tests/subdir1/subdir2a/binary1.bin', 'rb',
                 anonymous=True) as fp:
        bin2 = fp.read()
    assert bin1 == bin2
# Cache automatically deleted here

fc = FileCache(None)  # Use without context manager
# Also retrieve file without using open context manager
path1 = fc.retrieve('gs://rms-filecache-tests/subdir1/subdir2a/binary1.bin',
                    anonymous=True)
with open(path1, 'rb') as fp:
    bin1 = fp.read()
path2 = fc.retrieve('s3://rms-filecache-tests/subdir1/subdir2a/binary1.bin',
                    anonymous=True)
with open(path2, 'rb') as fp:
    bin2 = fp.read()
fc.delete_cache()  # Cache manually deleted here
assert bin1 == bin2

# Write a file to a bucket and read it back
with FileCache(None) as fc:
    with fc.open('gs://my-writable-bucket/output.txt', 'w') as fp:
        fp.write('A')
# The cache will be deleted here so the file will have to be downloaded
with FileCache(None) as fc:
    with fc.open('gs://my-writable-bucket/output.txt', 'r') as fp:
        print(fp.read())

The FCPath class is a reimplementation of the Python Path class to support remote acess using an associated FileCache. Like Path, an FCPath instance can contain any part of a URI, but only an absolute URI can be used when actually accessing the file specified by the FCPath. In addition, an FCPath can encapsulate various arguments such as anonymous and time_out so that they do not need to be specified to each access method. Thus, using this class can simplify the use of a FileCache by allowing the user to operate on paths using the simpler syntax provided by Path, and to not specify various other parameters at each method call site. If an FCPath instance is created without an explicitly-associated FileCache, then the default FileCache() is used, which specifies a shared cache named "global" that will persist after the program exits.

Compare this example to the one above:

from filecache import FileCache, FCPath
# Create a cache with a unique name that will be deleted on exit
with FileCache(None) as fc:  # Use as context manager
    # Use GS by specifying the bucket name and one directory level
    p1 = fc.new_path('gs://rms-filecache-tests/subdir1', anonymous=True)
    # Use S3 by specifying the bucket name and two directory levels
    # Alternative creation method
    p2 = FCPath('s3://rms-filecache-tests/subdir1/subdir2a', filecache=fc,
                anonymous=True)
    # Access GS using a directory + filename (since only one directory level
    # was specified by the FCPath)
    # The additional directory and filename are specified as an argument to open()
    # Also use open() as a context manager
    with p1.open('subdir2a/binary1.bin', 'rb') as fp:
        bin1 = fp.read()
    # Access S3 using a filename only (since two directory levels were already
    # specified by the FCPath)
    # The additional filename is specified by using the / operator to create a new
    # FCPath instance; anonymous=True is inherited
    with (p2 / 'binary1.bin').open(mode='rb') as fp:
        bin2 = fp.read()
    assert bin1 == bin2
# Cache automatically deleted here

A benefit of the abstraction is that different environments can access the same files in different ways without needing to change the program code. For example, consider a program that needs to access the file COISS_2xxx/COISS_2001/voldesc.cat from the NASA PDS archives. This file might be stored on the local disk in the user's home directory in a subdirectory called pds3-holdings. Or if the user does not have a local copy, it is accessible from a webserver at https://pds-rings.seti.org/holdings/volumes/COISS_2xxx/COISS_2001/voldesc.cat. Finally, it could be accessible from Google Cloud Storage from the rms-node-holdings bucket at gs://rms-node-holdings/pds3-holdings/volumes/COISS_2xxx/COISS_2001/voldesc.cat. Before running the program, an environment variable could be set to one of these values::

$ export PDS3_HOLDINGS_SRC="~/pds3-holdings"
$ export PDS3_HOLDINGS_SRC="https://pds-rings.seti.org/holdings"
$ export PDS3_HOLDINGS_SRC="gs://rms-node-holdings/pds3-holdings"

Then the program could be written as:

from filecache import FileCache
import os
with FileCache(None) as fc:
    p = fc.new_path(os.getenv('PDS3_HOLDINGS_SRC'))
    voldesc_path = p / 'volumes/COISS_2xxx/COISS_2001/voldesc.cat'
    contents = voldesc_path.read_text()
# Cache automatically deleted here

If the program was going to be run multiple times in a row, or multiple copies were going to be run simultaneously, using a shared cache would allow all of the processes to share the same copy, thus requiring only a single download no matter how many times the program was run. A shared cache is indicated by giving the cache a name (or no argument, which defaults to "global"); also FCPath defaults to using the global cache if no FileCache is specified. This results in the simplest form of the program:

from filecache import FCPath
import os
p = FCPath(os.getenv('PDS3_HOLDINGS_DIR'))
voldesc_path = p / 'volumes/COISS_2xxx/COISS_2001/voldesc.cat'
contents = voldesc_path.read_text()

Finally, there are four classes that allow direct access to the four possible storage locations without invoking any caching behavior: :class:FileCacheSourceLocal, :class:FileCacheSourceHTTP, :class:FileCacheSourceGS, and :class:FileSourceCacheS3:

from filecache import FileCacheSourceGS
src = FileCacheSourceGS('gs://rms-filecache-tests', anonymous=True)
src.retrieve('subdir1/subdir2a/binary1.bin', 'local_file.bin')

Details of each class are available in the module documentation.

Contributing

Information on contributing to this package can be found in the Contributing Guide.

Links

Licensing

This code is licensed under the Apache License v2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rms_filecache-2.0.0.tar.gz (204.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rms_filecache-2.0.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file rms_filecache-2.0.0.tar.gz.

File metadata

  • Download URL: rms_filecache-2.0.0.tar.gz
  • Upload date:
  • Size: 204.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for rms_filecache-2.0.0.tar.gz
Algorithm Hash digest
SHA256 d555c12cae3ef833b19faf4fce01c52d81addd08f7d36b525e074ba9420a25a0
MD5 2f903efa0f76999cd0a0856b25b23b96
BLAKE2b-256 2d92e3a9b5f3e2c231e3447444f1f981c273f706e832bba713f89a12c0f3347d

See more details on using hashes here.

File details

Details for the file rms_filecache-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: rms_filecache-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for rms_filecache-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be6d4dbae1a0201d9be3f85d9cd08b6f56c795748871c60851267b18c0951710
MD5 589b635b7fe7e459ff7705c95152c2c7
BLAKE2b-256 d9ced130f9deb16c6450c9182582d3921c0b5d82a7216b124316bf77d1c82f14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page