Sharing datasets via cloud storage
Project description
etcetera
Dataset sharing via cloud storage (S3, Google Storage)
Mental model
A dataset is an immutable collection of files organized in directories (e.g train/
, val/
).
A dataset can have a meta.json
file, which is a collection of arbitraty key/value pairs.
Dataset can be local or remote. Local datasets are kept in ~/.etc/
. Remote datasets are tgz
files stored in cloud storage.
PyPI package etcetera
provides:
- a command-line utility
etc
- Python package
etcetera
Using Command Line
etc -h
usage: etc [-h] {ls,register,pull,push,purge} ...
etcetera: managing cloud-hosted datasets
positional arguments:
{ls,register,pull,push,purge}
command
ls List datasets
register Register directory as a dataset
pull Pull dataset from repository
push Push dataset to the repository
purge Purge local dataset
optional arguments:
-h, --help show this help message and exit
Using Python
import etcetera as etc
dataset = etc.dataset('flower', auto_install=True)
dataset.keys()
>> { 'test', 'train' }
for filename in dataset['train'].iterdir():
print(filename)
>> "~/.etc/flower/train/data00001.txt"
>> "~/.etc/flower/train/data00002.txt"
dataset.meta
>> {}
dataset.root
>> "~/.etc/flower"
Installing
pip install 'etcetera[s3]'
Installs etceters
with the support for S3 cloud.
Configuration
~/.etc.toml
contains configuration for the service in TOML format. Example:
url = "s3://my-bucket"
Another example:
url = "s3://my-bucket"
public = false
aws_access_key_id = "Axxxx"
aws_secret_access_key = "Kxxx"
endpoint_url = "https://s3.amazonaws.com"
A configuration file is required for remote operations (pull
, push
, ls -r
). It is not required for local operations (ls
, register
).
In configuration file url
value is required. The rest is optional.
url
: URL of the remote repository. For example,s3://my-bucket
.public
: set totrue
if you wantpush
to create publicly-readable cloud files. Default isfalse
.aws_access_key_id
,aws_secret_access_key
,endpoint_url
: configuration files to access AWS api. If not set, the defaults from global AWS config will be used.
Command-line example
etc ls
etc ls -r
etc pull MNIST
etc register <directory> as SuperMNIST
Creating a dataset
A dataset must have:
data
directory (non-empty)data
directory must not have any files, only sub-directories (we call them "partitions")
Optional:
meta.json
README.md
- other sub-directories, for example
assets/
A minimal dataset example
sample/
data/
train/
data00001.json
data00002.json
data00003.json
A general dataset example
sample/
README.md
meta.json
assets/
Analysis.ipynb
DataCleanup.ipynb
data/
train/
data00001.json
...
test/
test00001.json
...
val/
val00001.json
...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.