Sharing datasets via cloud storage
Project description
etcetera
Dataset sharing via cloud storage (S3, Google Storage)
Mental model
A dataset is an immutable collection of files organized in directories (e.g train/
, val/
).
A dataset can have a meta.json
file, which is a collection of arbitraty key/value pairs.
Dataset can be local or remote. Local datasets are kept in ~/.etc/
. Remote datasets are tgz
files stored in cloud storage.
PyPI package etceters
provides:
- a command-line utility
etc
- Python package
etcetera
Using Command Line
etc ls
list local datasets.
etc ls --remote
list remote datasets.
etc pull <DATASET> [-f/--force]
downloads remote dataset and installs it locally.
etc push <DATASET> [-f/--force]
packages local dataset and uploads it to the cloud storage.
etc register <LOCAL_DIR> <DATASET> [-f/--force]
validates dataset and registers it as a local dataset
Using Python
import etcetera as etc
dataset = etc.dataset('flower', auto_install=True)
dataset.keys()
>> { 'test', 'train' }
for filename in dataset['train'].iterdir():
print(filename)
>> "~/.etc/flower/train/data00001.txt"
>> "~/.etc/flower/train/data00002.txt"
dataset.meta
>> {}
dataset.root
>> "~/.etc/flower"
Configuration
~/.etc.yaml
contains configuration for the service:
Example:
url: "s3://my-bucket"
aws_access_key_id: Axxxx
aws_secret_access_key: Axxx
public: false
Command-line example
etc ls
etc ls -r
etc pull MNIST
etc register <directory> as SuperMNIST
Creating a dataset
A dataset must have:
data
directory (non-empty)data
directory must not have any files, only sub-directories (we call them "partitions")
Optional:
meta.json
README.md
- other sub-directories, for example
assets/
A minimal dataset example
sample/
data/
train/
data00001.json
data00002.json
data00003.json
A general dataset example
sample/
README.md
meta.json
assets/
Analysis.ipynb
DataCleanup.ipynb
data/
train/
data00001.json
...
test/
test00001.json
...
val/
val00001.json
...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.