Sharing datasets via cloud storage
Project description
etcetera
Dataset sharing via cloud storage (S3, Google Storage)
Mental model
A dataset is an immutable collection of files organized in directories (e.g train/
, val/
).
A dataset can have a meta.json
file, which is a collection of arbitraty key/value pairs.
Dataset can be local or remote. Local datasets are kept in ~/.etc/
. Remote datasets are tgz
files stored in cloud storage.
PyPI package etcetera
provides:
- a command-line utility
etc
- Python package
etcetera
Using Command Line
etc -h
usage: etc [-h] {ls,register,pull,push,purge} ...
etcetera: managing cloud-hosted datasets
positional arguments:
{ls,register,pull,push,purge}
command
ls List datasets
register Register directory as a dataset
pull Pull dataset from repository
push Push dataset to the repository
purge Purge local dataset
optional arguments:
-h, --help show this help message and exit
Using Python
import etcetera as etc
dataset = etc.dataset('flower', auto_install=True)
dataset.keys()
>> { 'test', 'train' }
for filename in dataset['train'].iterdir():
print(filename)
>> "~/.etc/flower/train/data00001.txt"
>> "~/.etc/flower/train/data00002.txt"
dataset.meta
>> {}
dataset.root
>> "~/.etc/flower"
Installing
pip install 'etcetera[s3]'
Installs etceters
with the support for S3 cloud.
Configuration
~/.etc.toml
contains configuration for the service in TOML format. Example:
url = "s3://my-bucket"
Another example:
url = "s3://my-bucket"
public = false
aws_access_key_id = "Axxxx"
aws_secret_access_key = "Kxxx"
endpoint_url = "https://s3.amazonaws.com"
A configuration file is required for remote operations (pull
, push
, ls -r
). It is not required for local operations (ls
, register
).
In configuration file url
value is required. The rest is optional.
url
: URL of the remote repository. For example,s3://my-bucket
.public
: set totrue
if you wantpush
to create publicly-readable cloud files. Default isfalse
.aws_access_key_id
,aws_secret_access_key
,endpoint_url
: configuration files to access AWS api. If not set, the defaults from global AWS config will be used.
Command-line example
etc ls
etc ls -r
etc pull MNIST
etc register <directory> as SuperMNIST
Creating a dataset
A dataset must have:
data
directory (non-empty)data
directory must not have any files, only sub-directories (we call them "partitions")
Optional:
meta.json
README.md
- other sub-directories, for example
assets/
A minimal dataset example
sample/
data/
train/
data00001.json
data00002.json
data00003.json
A general dataset example
sample/
README.md
meta.json
assets/
Analysis.ipynb
DataCleanup.ipynb
data/
train/
data00001.json
...
test/
test00001.json
...
val/
val00001.json
...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for etcetera-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b19fafbe904d5f131e60fa68005b935050a71505622dcae062a6f331ed3d487a |
|
MD5 | 64a7fc3b461ab489e80a9264327e49bb |
|
BLAKE2b-256 | acd6a3f1c4e0555f9d02b6cc390ec99d67e33d4fdec2961a59ef42ffb9647729 |