Skip to main content

connect to read/write data to s3

Project description

PWBM_Cloud_Utils

Introduction

This Python module provides a convenient interface for handling input/output configurations, reading from different sources (local or cloud), and writing data to cloud storage (Amazon S3) or locally. It is designed to be flexible, supporting various data formats and compression options.

Note that the package officially supports pyarrow as the engine to work with parquet files. This is because pandas is adopting pyarrow for more efficient memory representation since Version 2.0. The code may still work with fastparquet, but it will not be supported officially.

Installation

To use this module, ensure that you have the required dependencies installed. You can install them using the following command:

pip install PWBM_Cloud_Utils

or

pipenv install PWBM_Cloud_Utils

Configuring AWS Access

You need to setup the AWS credentials to interact with cloud storages, such as S3.

  • Follow the operating system-specific instructions in AWS's website and install AWS CLI (Command Line Interface). The easiest option for Windows would be using the Microsoft Software Installer (MSI) file linked to this website. You need to do this in your installation account, i.e. .\<PennKey>-install, most likely, and re-login to your account.
  • Reach out to Yunye Jiang (yunyej@wharton.upenn.edu) to get the AWS access key ID and the associated AWS secret access key.
  • Run aws configure and enter the information. Use us-east-2 as the default AWS region, and you can skip (press enter) on the default output format. The secret information is now stored under ~/.aws/credentials and managed by AWS CLI. Please do NOT share this information with anyone else.

Local Environment Setup

An environment file will be automatically added when running on the cloud, so this is only relevant to running locally.

from PWBM_Cloud_Utils import IOConfig

# Create config
config = IOConfig()

You can also override all settings in the config object other than the AWS secrets. This may be helpful as you start to move files to AWS.

from PWBM_Cloud_Utils import IOConfig

# Default config

config = IOConfig(
    local_path = "//hpc3-fs.wharton.upenn.edu/PWBM", 
    region_name = None, 
    aws_model_bucket = None,
    cloud_data = None
)

# Example config with overrides

config = IOConfig(
    local_path = ".",
    region_name = "us-east-1", 
    aws_model_bucket = "some-bucket",
    cloud_data = True
)

PWBM_Utils Module Instructions

To integrate the PWBM_Utils module into your project, follow these steps:

Step 1: Create main.py

Your project should have a main.py file located in the root directory. This main.py file will be executed when running on AWS.

Step 2: Import CloudUtils Functions

Import several functions from CloudUtils for reading, writing, and loading parameters:

# Read and Write functions

from PWBM_Cloud_Utils import IOConfig
from PWBM_Cloud_Utils import IOReader
from PWBM_Cloud_Utils import IOWriter

# Load parameters from UI

from PWBM_Cloud_Utils import CloudUtils
from PWBM_Cloud_Utils import CloudMain

You can also import all of the above functions by importing PWBM_Cloud_Utils.

import PWBM_Cloud_Utils as utils

Step 3: Define Main Function

Define a main() function in your main.py file to handle different execution environments (cloud or local):

import json
import PWBM_Cloud_Utils as utils

def main():
    # Create config from your secrets credentials
    
    config = utils.IOConfig()
    
    # Note when cloud_data=False, the local_path used for io will default to HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM"
    # if you want to set a different local_path you can do that as follows
    # config = utils.IOConfig(local_path = "some_other_folder/data")

    
    # parse arguments from command line
    
    args = utils.CloudUtils.parse_args()

    if args.policy_id is not None:
        
        # Cloud version code

        cloud_main = utils.CloudMain(run_id=args.run_id, policy_id=args.policy_id)

        # Load data from the database
        
        NAME = cloud_main.Name
        OUTPUT_PATH = cloud_main.Output_Path # path to use when writing output
        RUNTIME_OPTIONS = cloud_main.Input_Config # includes "stacking_order", a list of policy_id in batch run
        POLICY_FILES = cloud_main.Policy_Files # gives you a list of dictionaries that contain file data

        # make list of policy files into a dictionary with full file name as key
        # Note: you don't need to do this, but makes demo more readable
        
        file_dict = {}
        for f in POLICY_FILES:
            file_dict[f"{f['name']}.{f['file_type']}"] = f

        # how to load a json parameter file into a dictionary
        
        json_dict = json.loads(file_dict["runtime1.json"]['data'])

        # how to load a csv parameter file into a pandas Dataframe
        
        csv_obj = io.StringIO(file_dict["parameters.csv"]['data'])
        csv_df = pd.read_csv(csv_obj)

        # how to access csv cells directly
        
        csv_list = []
        csv_rows = file_dict["parameters.csv"]['data'].split("\r\n")
        for row in csv_rows:
            csv_list.append([])
            items = row.split(",")
            for item in items:
                csv_list[len(csv_list) - 1].append(item)

        # alternatively, if you would like all the parameter files written to a local folder, 
        # you can call cloud_main.write_parameter_files(destination_path)
        
        cloud_main.write_parameter_files("local/path")
    else:
        # Local version code

        # output path will not be automatically generated so you should specify if running locally. 
        # the path should be relative to the local_path set in config which defaults to the 
        # HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM" (or the bucket if CloudData=TRUE in .env). 
        # this means if the full output path was "//hpc3-fs.wharton.upenn.edu/PWBM/Model/interfaces/2024-01-01 test", 
        # OUTPUT_PATH would be "Model/interfaces/2024-01-01 test"
        
        OUTPUT_PATH = ""

    # Your code follows the main function.

if __name__ == "__main__":
    main()

Step 3: Reading Data

The IOReader class allows you to read data from either cloud storage (Amazon S3) or a local file, depending on the configuration. This is how you would read in output produced by other components.

from PWBM_Cloud_Utils import IOConfig
from PWBM_Cloud_Utils import IOReader

# Create an IOConfig instance

config = IOConfig()

# Note when cloud_data=False, the local_path used for io will default to HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM"
# if you want to set a different local_path you can do that as follows
# config = IOConfig(local_path = "some_other_folder/data")
# See above for more config options.


# Create an IOReader instance with config
reader = IOReader(config)

# read contents of file at specified path as a string
# Note: this will only work with text files like .csv or .json
# Note: By default bucket_name="" which means bucket associated with your model (aka in IOConfig) will be used. local_path and cloud_data also default to setting in IOConfig.
# Note: If errors arg set to "warn", a warning message will be printed and None returned. If set to "ignore", no message printed and None returned. Any other string will result in the errors being thrown. By default set to "error", so errors will be thrown.

json_string = reader.read(
    "path/to/file/json file.json", # Required
    decompress=False, 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# how to load a json string into a dictionary

json_dict = json.loads(json_string)

# if file is compressed with gzip (will end in .gz), use decompress argument

json_string = reader.read("path/to/file/json file.json.gz", decompress=True)

# by default, reader uses the bucket associated with your model (aka in IOConfig) when cloud_data=False, 
# but you can read from other buckets with the bucket_name argument

json_string = reader.read("path/to/file/json file.json", bucket_name="another-model.pwbm-data")

# by default, reader uses HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM" when cloud_data=False, 
# but you can read from local locations with the local_path argument
# the example below would read from "./path/to/file/json file.json" 
# rather than "//hpc3-fs.wharton.upenn.edu/PWBM/path/to/file/json file.json"

json_string = reader.read("path/to/file/json file.json", local_path=".")

# by default, reader's cloud_data is set by the IOConfig used in the IOReader constructor, 
# but you can also override the config cloud_data at the read level.
# The example below would use cloud_data=True regardless if running locally or running on the cloud.

json_string = reader.read("path/to/file/json file.json", cloud_data=True)

# by default, the read will throw errors if something goes wrong for whatever reason, 
# but you can suppress these errors by setting errors to either "warn" or "ignore"

json_string = reader.read("path/to/file/json file.json", errors="warn")
json_string = reader.read("path/to/file/json file.json", errors="ignore")

# You can also define the path using the S3 URI. The bucket and path will be automatically parsed.
# So, in the example below, path="path/to/file/json file.json" and bucket_name="some-bucket"

json_string = reader.read("s3://some-bucket/path/to/file/json file.json")

# read contents of file at specified path as a bytes string
# Note: this will work with any file type.
# Note: decompress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

image_bytes = reader.read_bytes(
    "path/to/file/image.jpeg", # Required
    decompress=False,
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# read contents of the csv at specified path as a list of strings
# Note: decompress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

csv_row_list = reader.read_csv(
    "path/to/file/csv file.csv", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# read pickle at specified path and unpickle.
# Note: you must have the class(es) of the object(s) in pickle file, otherwise will error.
# Note: decompress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

pickle_obj = reader.read_pickle(
    "path/to/file/pickle file.pkl", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# read file at specified path as a pandas Dataframe
# Note: this will only work with csv, pickle, and parquet files. for other file types, see read_bytes example below.
# Note: decompress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()
# Note: pandas_args is a dict of args that will be added as named args on the pandas function

csv_df = reader.read_df(
    "path/to/file/csv file.csv", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

pkl_df = reader.read_df(
    "path/to/file/pickled df.pkl", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

parquet_df = reader.read_df(
    "path/to/file/parquet df.parquet", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

parquet_df_2 = reader.read_df(
    "path/to/file/pqt df.pqt", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

# to read a df from a different file type, use read_bytes and io.BytesIO. 
# this strategy will work with any file type for which pandas has a read function
# Note: may require installing optional dependencies
# Note: if this strategy does not work, you can use reader.read_file which 
# will copy the file to a local location where it can be read in with pandas as you would normally.

excel_bytes = reader.read_bytes("path/to/file/excel file.xlsx")

excel_df = pd.read_excel(io.BytesIO(excel_bytes))

# copy the file from the specified src path to specified dest path.
# Note: dest must always be local. only src impacted by cloud_data.
# Note: Use "." if located in the root directory
# Note: this will work with any file type.
# Note: decompress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

success = reader.read_file(
    "src/path", # Required
    "dest/path", # Required
    "some file.txt", # Required
    decompress=False, 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# copy contents of directory (folder) at src path to specified dest path
# Note: dest must always be local. only src impacted by cloud_data.
# Note: Use "." if located in the root directory
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

success = reader.read_directory(
    "src/path", # Required
    "dest/path", # Required
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# read zipped directory (aka archive) at specified src path, unpack it, and copy it to the dest path.
# Note: the file you are unpacking must have a file extension matching your selected archive format, 
# but do not include the extension when specifying folder_archive name.
# Note: contents will be put in a folder with the same name as the archive. 
# Meaning files in cloud/path/folder_archive.zip will be copied to local/path/folder_archive
# Note: format_archive is the format of archived directory. 
# Possible values are: "zip", "tar", "gztar", "bztar", and "xztar". By default, "zip".
# Note: dest must always be local. only src impacted by cloud_data.
# Note: Use "." if located in the root directory
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

success = reader.read_zip_directory(
    "src/path", # Required
    "dest/path", # Required
    "folder_archive", # Required
    format_archive="zip", 
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# check if file or folder exists at specified path.
# Note: Use "." if checking in the root directory/bucket
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

exists = reader.exists(
    "path/to/file/json file.json", # Required
    is_folder=False,
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

exists = reader.exists(
    "path/to/folder", # Required
    is_folder=True,
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# check if file exists on cloud at specified path.
# List all files in directory at specified location, including those in subfolders. 
# Note: only files included in returned list.
# Note: only results that match regex pattern will be included.
# Note: Use "." if checking in the root directory/bucket
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as reader.read()

file_list = reader.list_directory(
    "path/to/folder", # Required
    search_regex="",
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

Step 4: Writing Data

The IOWriter class enables you to write data to cloud storage (Amazon S3) or a local file, depending on the configuration. This is how you would write output to be read by other components.

You can use IOWriter to write to any bucket, but if you are writing output, make sure to get Output_Path from CloudMain.

from PWBM_Cloud_Utils import IOConfig
from PWBM_Cloud_Utils import IOWriter
from PWBM_Cloud_Utils import CloudUtils
from PWBM_Cloud_Utils import CloudMain

# parse arguments from command line

args = CloudUtils.parse_args()

if args.policy_id is not None:
    
    # Cloud version code

    cloud_main = CloudMain(run_id=args.run_id, policy_id=args.policy_id)

    OUTPUT_PATH = cloud_main.Output_Path # path to use when writing output

else:
    # Local version code

    # output path will not be automatically generated so you should specify if running locally. 
    # the path should be relative to the local_path set in config which 
    # defaults to the HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM" (or the bucket if CloudData=TRUE in .env). 
    # this means if the full output path was "//hpc3-fs.wharton.upenn.edu/PWBM/Model/interfaces/2024-01-01 test", 
    # OUTPUT_PATH would be "Model/interfaces/2024-01-01 test"
    
    OUTPUT_PATH = ""

# Create an IOConfig instance

config = IOConfig()

# Note when cloud_data=False, the local_path used for io will default to HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM"
# if you want to set a different local_path you can do that as follows
# config = IOConfig(local_path = "some_other_folder/data")
# See above for more config options.


# Create an IOWriter instance with config

writer = IOWriter(config)

# write string contents to file at specified path
# Note: this will only work with text files like .csv or .json
# Note: By default bucket_name="" which means bucket associated with your model (aka in IOConfig) will be used. local_path and cloud_data also default to setting in IOConfig.
# Note: If errors arg set to "warn", a warning message will be printed and None returned. If set to "ignore", no message printed and False returned. Any other string will result in the errors being thrown. By default set to "error", so errors will be thrown.

json_string = '{"Hello":["World"]}'

success = writer.write(
    os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), # Required
    json_string, # Required
    compress=False, 
    bucket_name = "", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# if you would like to compress file with gzip (will end in .gz), use compress argument

pickle_obj = ""

success = writer.write(os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), json_string, compress=True)

# by default, writer uses the bucket associated with your model (aka in IOConfig), 
# but you can write to other buckets with the bucket_name argument

success = writer.write(os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), json_string, bucket_name="another-model.pwbm-data")

# by default, writer uses HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM" when cloud_data=False, 
# but you can write to local locations with the local_path argument
# the example below would write to "./path/to/file/json file.json" 
# rather than "//hpc3-fs.wharton.upenn.edu/PWBM/path/to/file/json file.json"

success = writer.write(os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), json_string, local_path=".")

# by default, writer's cloud_data is set by the IOConfig used in the IOWriter constructor, 
# but you can also override the config cloud_data at the write level.
# The example below would use cloud_data=True regardless if running locally or running on the cloud.

success = writer.write(os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), json_string, cloud_data=True)

# by default, the write will throw errors if something goes wrong for whatever reason, 
# but you can suppress these errors by setting errors to either "warn" or "ignore"

success = writer.write(os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), json_string, errors="warn")
success = writer.write(os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), json_string, errors="ignore")

# You can also define the path using the S3 URI. The bucket and path will be automatically parsed.
# So, in the example below, path="path/to/file/json file.json" and bucket_name="some-bucket"

success = writer.write("s3://some-bucket/path/to/file/json file.json", json_string)

# write bytes string contents to file at specified path
# Note: this will work with any file type.
# Note: compress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

json_bytes = b'{"Hello":["World"]}'

success = writer.write_bytes(
    os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), # Required
    json_bytes, # Required
    compress=False, 
    bucket_name = "", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# write list of row strings to the csv at specified path
# Note: compress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

csv_row_list = ["h1,h2,h3", "1,2,3", "4,5,6"]

success = writer.write_csv(
    os.path.join(OUTPUT_PATH, "path/to/file/csv file.csv"), # Required
    csv_row_list, # Required
    compress=False, 
    bucket_name = "", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# write obj to the pickle at specified path
# Note: compress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

pickle_obj = "any obj"

success = writer.write_pickle(
    os.path.join(OUTPUT_PATH, "path/to/file/pickle file.pkl"), # Required
    pickle_obj, # Required
    compress=False, 
    bucket_name = "", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# write pandas Dataframe to file at specified path
# Note: this will only work with csv, pickle, and parquet files. for other file types, see write_bytes example below.
# Note: compress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()
# Note: pandas_args is a dict of args that will be added as named args on the pandas function

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

success = writer.write_df(
    os.path.join(OUTPUT_PATH, "path/to/file/csv df.csv"), # Required
    df, # Required
    compress=False, 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

success = writer.write_df(
    os.path.join(OUTPUT_PATH, "path/to/file/pickled df.pkl"), # Required
    df, # Required
    compress=False, 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

success = writer.write_df(
    os.path.join(OUTPUT_PATH, "path/to/file/parquet df.parquet"), # Required
    df, # Required
    compress=False, 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

success = writer.write_df(
    os.path.join(OUTPUT_PATH, "path/to/file/pqt df.pqt"), # Required
    df, # Required
    compress=False, 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error",
    pandas_args={}
)

# to write a df to a different file type, use write_bytes and io.BytesIO. 
# this strategy will work with any file type for which pandas has a to_format function
# Note: may require installing optional dependencies
# Note: if this strategy does not work, you can use writer.write_file which will copy
# the file from a local location, so you can write file with pandas as you would normally.

excel_bytes = io.BytesIO()
df.to_excel(excel_bytes)
excel_bytes = excel_bytes.getvalue()

success = writer.write_bytes(os.path.join(OUTPUT_PATH, "path/to/file/excel file.xlsx"), excel_bytes)

# copy the file from the specified src path to specified dest path.
# Note: this will work with any file type.
# Note: src must always be local. only dest impacted by cloud_data.
# Note: Use "." if located in the root directory
# Note: compress, bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

success = writer.write_file(
    os.path.join(OUTPUT_PATH, "dest/path"), # Required
    "src/path", # Required
    "some file.txt", # Required
    compress=False, 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# copy contents of directory (folder) at src path to specified dest path
# Note: src must always be local. only dest impacted by cloud_data.
# Note: Use "." if located in the root directory
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

success = writer.write_directory(
    os.path.join(OUTPUT_PATH, "dest/path"), # Required
    "src/path", # Required
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# archive (aka zip) specified directory at the src path and copy to the dest path
# Note: src must always be local. only dest impacted by cloud_data.
# Note: the archive will have the same name as the folder. 
# Meaning files in local/path/folder_archive will be copied to cloud/path/folder_archive.zip
# Note: format_archive is the format of archived directory. 
# Possible values are: "zip", "tar", "gztar", "bztar", and "xztar". By default, "zip".
# Note: Use "." if located in the root directory
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

success = writer.write_zip_directory(
    os.path.join(OUTPUT_PATH, "dest/path"), # Required
    "src/path", # Required
    "folder_archive", # Required
    format_archive="zip", 
    bucket_name="", 
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# check if file or folder exists at specified path.
# Note: Use "." if checking in the root directory/bucket
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

exists = writer.exists(
    os.path.join(OUTPUT_PATH, "path/to/file/json file.json"), # Required
    is_folder=False,
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

exists = writer.exists(
    os.path.join(OUTPUT_PATH, "path/to/folder"), # Required
    is_folder=True,
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

# check if file exists on cloud at specified path.
# List all files in directory at specified location, including those in subfolders. 
# Note: only files included in returned list.
# Note: only results that match regex pattern will be included.
# Note: Use "." if checking in the root directory/bucket
# Note: bucket_name, local_path, cloud_data, errors, 
# and S3 URI path option work the same as writer.write()

file_list = writer.list_directory(
    os.path.join(OUTPUT_PATH, "path/to/folder"), # Required
    search_regex="",
    bucket_name="",
    local_path="", 
    cloud_data=None, 
    errors="error"
)

Step 5: Caching data between runs

You can cache data between batch runs using reader.read_in_cache and writer.write_out_cache.

Please note that because batch runs are done in parallel on the cloud, runs will not necessarily have access to cache output of other runs in the same batch. To ensure the cache is available, we recommend that you trigger a run list with a single policy (typically baseline), wait for that to complete, and then kick off any runs that would like to use that run's cache.

Also, please note that reading and in particular writing out a large cache can take a long time. If your project typically carries a large cache, we recommend using writer.write_out_cache as infrequently as possible (i.e. maybe only use writer.write_out_cache if baseline).

If running locally, reader.read_in_cache and writer.write_out_cache don't do anything if cloud_data=False in IOConfig. However, if running locally and cloud_data=False in IOConfig, we recommend disabling reader.read_in_cache and writer.write_out_cache since your local version of code likely does not match the cloud version of code.

Finally, the stored cache will be cleared every time you deploy your model to AWS.

from PWBM_Cloud_Utils import IOConfig
from PWBM_Cloud_Utils import IOReader
from PWBM_Cloud_Utils import IOWriter

# Create an IOConfig instance

config = IOConfig()

# Note when cloud_data=False, the local_path used for io will default to HPC drive "//hpc3-fs.wharton.upenn.edu/PWBM"
# if you want to set a different local_path you can do that as follows
# config = IOConfig(local_path = "some_other_folder/data")
# See above for more config options.

reader = IOReader(config)
writer = IOWriter(config)

# read in cache from previous runs
# Usage: reader.read_in_cache(cache_folder_path, cache_folder_name)
# so the following would put the cache files in local_folder/.cache 
# if cache located in root use "." as cache_folder_path
# Note: caches are cleared after model is redeployed
# Note: the same model can have multiple caches but they must have unique names.

reader.read_in_cache("local_folder", ".cache", cloud_data=None)

# write out cache to use in subsequent runs
# Usage: writer.write_out_cache(cache_folder_path, cache_folder_name)
# so the following would use local_folder/.cache as the folder to cache
# if cache located in root use "." as cache_folder_path
# Note: caches are cleared after model is redeployed
# Note: the same model can have multiple caches but they must have unique names.

writer.write_out_cache("local_folder", ".cache", cloud_data=None)

Notes

Ensure that your environment file (.env) contains the necessary variables, such as Region_Name, AWS_ACCESS_KEY_ID, and AWS_ACCESS_KEY_SECRET. The module uses the boto3 library for Amazon S3 interactions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pwbm_cloud_utils-0.1.5.tar.gz (49.7 kB view details)

Uploaded Source

Built Distribution

PWBM_Cloud_Utils-0.1.5-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file pwbm_cloud_utils-0.1.5.tar.gz.

File metadata

  • Download URL: pwbm_cloud_utils-0.1.5.tar.gz
  • Upload date:
  • Size: 49.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for pwbm_cloud_utils-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e29867698edbaa91e30e49604c89002bbe553807145236b832a7186ab51d382f
MD5 09b7641e47ef1a759f0cef2c6df19173
BLAKE2b-256 2d4920cb7d5d6f71684a5db3132c2d92a1eff72a520536cdd999a186c5ecaa76

See more details on using hashes here.

File details

Details for the file PWBM_Cloud_Utils-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for PWBM_Cloud_Utils-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 562e100721dcbb128bc0abab9c4527ec468d2984e2e5ad94919c8fda70e897be
MD5 ab0d58c6a25e5534bf04a20c08e3b7b3
BLAKE2b-256 cfee65e920ad815eef4dc1f0677a9bd897ef921fde9ab4903bbedc835ea04f60

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page