Skip to main content

Sharing datasets via cloud storage

Project description

etcetera

Build Status PyPI version Documentation Status

Dataset sharing via cloud storage (S3, Google Storage)

Mental model

A dataset is an immutable collection of files organized in directories (e.g train/, val/).

A dataset can have a meta.json file, which is a collection of arbitraty key/value pairs.

Dataset can be local or remote. Local datasets are kept in ~/.etc/. Remote datasets are tgz files stored in cloud storage.

PyPI package etcetera provides:

  • a command-line utility etc
  • Python package etcetera

Using Command Line

etc -h
usage: etc [-h] {ls,register,pull,push,purge} ...

etcetera: managing cloud-hosted datasets

positional arguments:
  {ls,register,pull,push,purge}
                        command
    ls                  List datasets
    register            Register directory as a dataset
    pull                Pull dataset from repository
    push                Push dataset to the repository
    purge               Purge local dataset

optional arguments:
  -h, --help            show this help message and exit

Using Python

import etcetera as etc

dataset = etc.dataset('flower', auto_install=True)
dataset.keys()
>> { 'test', 'train' }

for filename in dataset['train'].iterdir():
    print(filename)
>> "~/.etc/flower/train/data00001.txt"
>> "~/.etc/flower/train/data00002.txt"

dataset.meta
>> {}
dataset.root
>> "~/.etc/flower"

Configuration

~/.etc.toml contains configuration for the service in TOML format. Example:

url = "s3://my-bucket"

Another example:

url = "s3://my-bucket"
public = false
aws_access_key_id = Axxxx
aws_secret_access_key = Axxx
endpoint_url = https://s3.amazonaws.com

A configuration file is required for remote operations (pull, push, ls -r).

URL value is required. The rest is optional.

  • url: URL of the remote repository. For example, s3://my-bucket.
  • public: set to true if you want push to create publicly-readable cloud files. Default is false.
  • aws_access_key_id, aws_secret_access_key, endpoint_url: configuration files to access AWS api. If not set, the defaults from global AWS config will be used.

Command-line example

etc ls
etc ls -r
etc pull MNIST
etc register <directory> as SuperMNIST

Creating a dataset

A dataset must have:

  1. data directory (non-empty)
  2. data directory must not have any files, only sub-directories (we call them "partitions")

Optional:

  1. meta.json
  2. README.md
  3. other sub-directories, for example assets/

A minimal dataset example

sample/
    data/
        train/
            data00001.json
            data00002.json
            data00003.json

A general dataset example

sample/
    README.md
    meta.json
    assets/
        Analysis.ipynb
        DataCleanup.ipynb
    data/
        train/
            data00001.json
            ...
        test/
            test00001.json
            ...
        val/
            val00001.json
            ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etcetera-0.0.5-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file etcetera-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: etcetera-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.0

File hashes

Hashes for etcetera-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3112f73279199cf766ab7c9fd086940772c099831fedc9066ca730bbae884e6e
MD5 806ef42f9e1c084daba3a7369b506b78
BLAKE2b-256 fd2005c6ea8b87b84e755067aa009cc6c49d0ccc6fa47d69ebee31063f6564e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page