Skip to main content

A Virtualitics S3 Utility Library with Local File System Mirror.

Project description

Virt-S3 🪣

A Virtualitics utility package to handle file I/O with Object Storage Systems like AWS S3 and Minio.

With versatility in mind, virt-s3 was designed to be a relatively lightweight package that can either used independently or in conjunction with the larger Virtualitics AI platform. The virt-s3 module includes two primary submodules s3 and fs that implement each API function of the virt-s3 module specific to the target system: either S3/S3-like systems or local file systems.

We hope that you can use it, break it, and even help us improve it!

Table of Contents

  1. Prerequisites
  2. Example Usage
  3. Architecture
  4. Getting Started
  5. Documentation

Prerequisites

  • Requires python>=3.11
  • Local File System features currently only support posix / pathing (Linux, Mac, etc.)
    • Support for Windows \ pathing [Coming Soon]

Example Usage

Writing a File

import virt_s3
import pandas as pd
from io import BytesIO

# ENV variable `Local_FS` = '1' or '0' (local file system or S3)
params = virt_s3.get_default_params()

df = pd.DataFrame([{'a': 1, 'b': 2}])
buffer = BytesIO()
df.to_csv(buffer, index=None)

with virt_s3.SessionManager(params=params) as session:
    virt_s3.create_bucket('test-bucket', params=params, client=session)
    path = f"fixture/data/data.csv"
    saved_key = virt_s3.upload_data(buffer.getbuffer(), path, params=params, client=session)

Reading a File

import virt_s3
import pandas as pd

# ENV variable `Local_FS` = '1' or '0' (Local file system or S3)
params = virt_s3.get_default_params()

# use context manager to manage session scope
with virt_s3.SessionManager(params=params) as session:
    data = virt_s3.get_file(saved_key, bytes_io=True, params=params, client=session)
    df = pd.read_csv(data)

Architecture

Virt-S3 can be run on a local machine or from within a docker container. Additionally, it includes a variety of ways to interact with Object Storage Systems like AWS S3 and Minio in different hosting environments along with support for local file system access on host machine within docker container.

This versatility along with its lightweight set of dependencies allows virt-s3 to be easily installed and used in various types of environments.

Getting Started

  1. Create a fresh virtual environment with python >= 3.11

  2. Install the necessary dependencies

Basic Install (No Extras)

$ pip install virt-s3

Install with Single Extra

$ pip install "virt-s3[s3]"

Install with Multiple Extras

$ pip install "virt-s3[s3,dataframe,image]"
  • The Following Extras are Available:

    • s3 = installs dependencies required to interact with object stores like Minio/S3 (primarily relying on boto3)
    • dataframe = installs dependencies required for using numpy, pandas, and pyarrow dataframe/parquet operations
    • image = installs dependencies required to utilize image operations (e.g. get file as an image)
  • e.g. If you want to use virt_s3, but can't install pandas or pyarrow in your restricted environment, then you can simply install virt_s3 without the dataframe extra dependencies. You won't be able to use virt_s3.extras.CSVFileValidator, virt_s3.extras.ParquetFileValidator, read_parquet_file_df, and write_parquet_file_df but these are also not necessarily core functions of the library (therefore extras).

  1. Make sure the following environment variables are set
#########################################
# Required Custom Environment Variables #
#########################################
LOCAL_FS_USER=<your username>
# use the local fs mirror or s3/minio: 1 = True, 0 = False
LOCAL_FS=0
LOCAL_FS_ROOT_DIR=</path/to/your/data/dir/>

########################################################
# Required Virtualitics Platform Environment Variables #
########################################################
# e.g. http://mock-s3:9000 or http://localhost:9000
S3_URL=<your s3/minio url>
S3_DEFAULT_BUCKET=test-buck<your bucket name>
AWS_SECRET_ACCESS_KEY=<your aws secret access key>
AWS_ACCESS_KEY_ID=<your aws access key id>
# e.g. us-east-1
AWS_REGION=<your aws region>
  • Note: S3_URL can be replaced with a localhost url (e.g. http://localhost:9000) if not being run within a docker container
  1. Run the above example usage

Code Documentation

API Description
PredictConnectionStoreParams Dataclass for Predict Connection Store Parameters
S3Params Dataclass for S3 Boto3 Connection Parameters
SessionManager General Session Context Manager for virt_s3 repo
TransferConfig boto3.s3.TransferConfig used to configure higher throughput upload/download functions
LocalFSParams Dataclass for Using Local File System for all S3 Calls
ImageFormatType Enum class type for Image Format Types
get_default_params() Function to get default parameters to use for all functions (default behavior is based off of ENV variables)
get_session_client() Function to get session client based on passed in S3Params or LocalFSParams
create_bucket() Function to create a bucket to read and write from
get_file_chunked() Function to get a file using a chunking loop. This can be useful when trying to retrieve very large files
get_file() Function to retrieve specified file as in-memory object
get_image() Function to get image from either s3 or local file system
get_files_generator() Generator function to quickly loop through reading a list of keys or file paths
get_files_batch() Function to get list of file paths or key paths in batch
list_dirs() Function to list valid 'folders' within 'bucket'
get_valid_file_paths() Function to get list of valid file paths or keys within particular directory of bucket
file_exists() Function to see if key or file path exists in bucket
upload_data() Function to upload data to S3 or local file system
delete_file() Function to delete a file from s3 or local file system
delete_files_by_dir() Function to delete all files and subdirectories, etc. in a given folder
archive_zip_as_buffer() Function to create a zip archive from dictionary of expected archive filepaths and data bytes
extract_compressed_file() Function to extract zip file contents into bucket
format_bytes() Funtion to take as input a number of bytes and return a formatted string for B, KB, MB, GB
read_parquet_file_df() Convenience function to read parquet file as pandas DataFrame
write_parquet_file_df() Convenience function to write pandas DataFrame to parquet file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

virt_s3-0.1.1.tar.gz (40.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

virt_s3-0.1.1-py3-none-any.whl (48.3 kB view details)

Uploaded Python 3

File details

Details for the file virt_s3-0.1.1.tar.gz.

File metadata

  • Download URL: virt_s3-0.1.1.tar.gz
  • Upload date:
  • Size: 40.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.1.115-126.197.amzn2023.x86_64

File hashes

Hashes for virt_s3-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1b71231670d62d37f4014bd4f2194e0aa8daba58c5d5ef89e0329937262aab7b
MD5 a96f5fc25955c24e32a50c95de04b89c
BLAKE2b-256 c3e5fb213e065a410c57630fcd2167cefdcb6e19c829609a940dbc39cde60179

See more details on using hashes here.

File details

Details for the file virt_s3-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: virt_s3-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 48.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.1.115-126.197.amzn2023.x86_64

File hashes

Hashes for virt_s3-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc31b3fec217130a64db25336f9d56d8061e34f399fe8209dd2a89e99318ca29
MD5 c6163bfc794c54e48f4cadf5c0997d7c
BLAKE2b-256 0521ce96f858ce43e7d1a84844849a825e4461876b948578c2e0ce24f0219b30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page