Make it easy to work with local or cloud storage as part of a data science workflow
Project description
Storage tools
The goal of this project is to make it easy to work with local or cloud storage as part of a data science workflow.
Key Features
You can access a dataset with a single command ...
e.g. download_dataset azure_demo mnist/hand_drawn_digits
... or a few lines of python
Just import the storage tools core module, create a client and download your dataset.
You don't put secret keys in your code
Secret keys live in their own files and storage tools knows how to find them.
Datasets are version controlled
Storage tools makes it easy to
- manage multiple versions of a dataset and
- know which version of the dataset you are working with locally.
Install
pip install storage_tools
Storage tools conventions
We recommend that you ...
Use forward slashes on all platforms
Use forward slashes when specifying files, paths and dataset names.
Avoid whitespace and special characters in file and path names
Use folders for secrets and data
project_root ∟ data ∟ secrets ∟ settings.ini
Add the following to .gitignore
secrets/
data/
Add the following to settings.ini
[DEFAULT]
local_path=data
Running storage_client from your project root will read project_root/secrets/settings.ini
and save all local data to project_root/data
How to use azure storage
If we follow the above conventions and have a project folder containing
project_root ∟ data ∟ mnist ∟ hand_drawn_digits ∟ digit0.png ∟ digit1.png ∟ ... ∟ secrets ∟ settings.ini ∟ main.py
where settings.ini
contains
[DEFAULT]
local_path=data
[azure_demo]
storage_client=storage_tools.core.AzureStorageClient
conn_str=<A connection string to an Azure Storage account without credential>
credential=<The credentials with which to authenticate>
container=<The name of a storage container>
We can use main.py
to
Create a new storage_client
storage_client=new_storage_client('azure_demo')
List files in the azure container (remote)
storage_client.ls()
List files in data (local)
storage_client.ls('local_path')
Create a new version of a dataset
storage_client.upload_dataset('mnist/hand_drawn_digits','major')
Note: If you run storage_client.ls()
again, you'll see the new file in the azure container.
Download the latest version of a dataset
Feel free to delete your local copy of this dataset (from data) to download from azure storage.
storage_client.download_dataset('mnist/hand_drawn_digits')
Note: If you run storage_client.ls('local_path')
again, you'll see the dataset in data.
See BlobServiceClient docs for more details on the settings used in settings.ini
from_connection_string
(conn_str
andcredential
)get_container_client
(container
)
How to use AWS storage
It's the same as Azure except settings.ini
contains
[DEFAULT]
local_path=data
[aws_demo]
storage_client=storage_tools.core.AwsStorageClient
service_name=s3
aws_access_key_id=<An AWS access key ID>
aws_secret_access_key=<An AWS access key>
bucket=<The name of an AWS bucket that the access key is allowed to read from and write to>
Developers
git config --global core.autocrlf input
conda create -n storage_tools python==3.8 -y
conda activate storage_tools
pip install fastcore nbdev jupyter
pip install boto3 azure-storage-blob
Type checking with mypy
!pip install mypy
Then from the storage_tools project folder
nbdev_build_lib
mypy storage_tools/core.py --ignore-missing-imports
For now, I'm ignoring the "Skipping analyzing 'azure': found module but no type hints or library stubs" error
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for storage_tools-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac14f48d7d1da2c53d90f64573518d08c01f9f17e615f7d8c08a27865a7f2383 |
|
MD5 | 4ce8c528d6fdfd808d3d9dd937e017a9 |
|
BLAKE2b-256 | 947da99c43e562177d46b85d213c85fd03de4365e3ab0b62119985bbebeb7775 |