A Python-to-S3 interface with added convenience features.
Project description
rivet
A user-friendly Python-to-S3 interface. Adds quality of life and convenience features around boto3, including the handling of reading and writing to files in proper formats. While there is nothing that you can do with rivet that you can't do with boto3, rivet's primary focus is ease-of-use. By handling lower-level operations such as client establishment and default argument specification behind the scenes, the cost of entry to interacting with cloud storage from within Python is lowered.
It also enforces good practice in S3 naming conventions.
Usage
rivet acts as an abstraction around the S3 functionality of Amazon's boto3 package.
Although boto3 is very powerful, the expansive functionality it boasts can be overwhelming
and often results in users sifting through a lot of documentation to find the subset of
functionality that they need. In order to make use of this package, you will need to have
the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY configured
for the buckets you wish to interact with.
General
-
Because S3 allows for almost anything to be used as an S3 key, it can be very easy to lose track of what exactly you have saved in the cloud. A very important example of this is filetype - without a file extension at the end of the S3 key, it is entirely possible to lose track of what format a file is saved as.
rivetenforces file extensions in the objects it reads and writes.- Currently supported formats are: CSV, JSON, Avro, Feather, Parquet, Pickle
- Accessible in a Python session via
rivet.supported_formats
-
A default S3 bucket can be set up as an environment variable, removing the requirement to provide it to each function call. The name of this environment variable is
RV_DEFAULT_S3_BUCKET.
Reading
Reading in rivet only requires two things: a key, and a bucket.
import rivet as rv
df = rv.read('test_path/test_key.csv', 'test_bucket')
The file will be downloaded from S3 to a temporary file on your machine, and based on the file extension at the end of the S3 key, the proper file reading function will be used to read the object into the Python session.
Because it cannot be expected that all teams will always utilize good practice though,
the read_badpractice function allows for reading of files that do not have a file
extension (or do not follow enforced key-writing practices). In addition to a key
and bucket, this function requires that a storage format is provided.
import rivet as rv
obj = rv.read_badpractice('test_path/bad_key', 'test_bucket', filetype='pkl')
Both the read and read_badpractice functions accept additional arguments
for the underlying file reading functions. So, if a user is familiar with
those functions, they can customize how files are read.
import rivet as rv
df = rv.read('test_path/test_key.csv', 'test_bucket', delimiter='|')
Writing
Writing is handled almost identically to reading, with the additional
parameter of the object to be uploaded. write returns the full path to
the object written to S3, including bucket name, without the s3:// prefix.
import pandas as pd
import rivet as rv
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
rv.write(df, 'test_path/test_key.csv', 'test_bucket')
Similar to the read functionality, write determines which underlying write
function to use based on the file extension in the S3 key provided. It can
accept additional arguments to be passed to those functions, exactly like
in the reading functions. However, unlike the reading functions, there is
no 'bad practice' writing funcitonality. The rivet developers understand that
its users can't control the practices of other teams, but as soon as writing
begins, the package will ensure that best practice is being followed.
Other operations
- Listing
rivetcan list the files that are present at a given location in S3, with two different options being available for how to do so:include_prefixandrecursive.
We will be using the following example S3 bucket structure:
test_bucket
|---- test_key_0.csv
|---- folder0/
|---- test_key_1.pq,
|---- folder1/
|---- test_key_2.pkl,
|---- subfolder0/
|---- test_key_3.pkl,
|---- folder2/
|---- test_key_4.csv
-
rv.listwould behave as follows with default behavior:import rivet as rv rv.list(path='', bucket='test_bucket') Output: ['test_key_0.csv', 'folder0/', 'folder1/', 'folder2/'] rv.list(path='folder1/', bucket='test_bucket') Output: ['test_key_2.pkl', 'subfolder0/'] -
include_prefixoption will result in the full S3 key up to the current folder to be included in the returned list of keys.import rivet as rv rv.list_objects(path='folder1/', bucket='test_bucket', include_prefix=True) Output: ['folder1/test_key_2.pkl', 'folder1/subfolder0/'] -
The
recursiveoption will result in objects stored in nested folders to be returned as well.import rivet as rv rv.list(path='folder1', bucket='test_bucket', recursive=True) Output: ['test_key_2.pkl', 'subfolder0/test_key_3.pkl'] -
include_prefixandrecursivecan be used simultaneously. -
Regular expression matching on keys can be performed with the
matchesparameter.-
You can account for your key prefix:
- In the
pathargument (highly encouraged for the above reasons):rv.list_objects(path='folder0/') - Hard-coded as part of the regular expression in your
matchesargument:rv.list_objects(matches='folder0/.*') - or by accounting for it in the matching logic of your regular expression:
rv.list_objects(matches='f.*der0/.*')
- In the
-
When you are using both
pathandmatchesparameters, however, there is one situation you need to be cautious of:- Hard-coding the path in
pathand usingmatchesto match on anything that comes after the path works great:rv.list_objects(path='folder0/', matches='other_.*.csv') - Hard-coding the path in
pathand including the hard-coded path inmatchesworks fine, but is discouraged for a number of reasons:rv.list_objects(path='folder0/', matches='folder0/other_.*.csv') - What will not work is hard-coding the path in
pathand dynamically matching it inmatches:rv.list_objects(path='folder0/', matches='f.*der0/other_.*.csv')- This is because including the path in the regular expression interferes with the logic of the function. When you provide the hard-coded path both in
pathand in the beginning ofmatches, it can be detected and removed from the regular expression, but there is no definitive way to do this when you are matching on it.
- This is because including the path in the regular expression interferes with the logic of the function. When you provide the hard-coded path both in
- Hard-coding the path in
-
So, in general, try to separate the keep
pathandmatchesentirely separate if at all possible.
-
- Existence checks
As an extension of listing operations,rivetcan check if an object exists at a specific S3 key. Note that for existence to beTrue, there must be an exact match with the key provided
Using the following bucket structure:
test_bucket
|---- test_key_0.csv
import rivet as rv
rv.exists('test_key_0.csv', bucket='test_bucket')
Output: True
rv.exists('test_key_1.csv', bucket='test_bucket')
Output: False
rv.exists('test_key_.csv', bucket='test_bucket')
Output: False
- Copying
It is possible to copy a file from one location in S3 to another usingrivet. This function is not configurable - it only takes a source and destination key and bucket.
import rivet as rv
rv.copy(source_path='test_path/df.csv',
dest_path='test_path_destination/df.csv',
source_bucket='test_bucket',
dest_bucket='test_bucket_destination')
Session-Level Configuration
rivet outputs certain messages to the screen to help interactive users
maintain awareness of what is being performed behind-the-scenes. If this
is not desirable (as may be the case for notebooks, pipelines, usage of
rivet within other packages, etc.), all non-logging output can be
disabled with rv.set_option('verbose', False).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rivet-1.6.0.tar.gz.
File metadata
- Download URL: rivet-1.6.0.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b19f047a70fea46baf012eb5791bbc44401f467b28601b4926464320e8afbdf6
|
|
| MD5 |
897f8c9494709e5896208182fa9b8a14
|
|
| BLAKE2b-256 |
58c344cbfea37c674cca411b095b0d512c7d4c350a40de1088cc120437219507
|
File details
Details for the file rivet-1.6.0-py2.py3-none-any.whl.
File metadata
- Download URL: rivet-1.6.0-py2.py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d95d2f78dd7773cd4a238ba8aab49662e5208078b07b2e0836a1cb654d4fbe7
|
|
| MD5 |
ca3f93eb9e2f5db0f700a727e21ff5ac
|
|
| BLAKE2b-256 |
20b3421c48c4065da81267831679b20c8e2848a4a2f8883d9f5bfb8052751595
|