Skip to main content

Sharing datasets via cloud storage

Project description

etcetera

Build Status PyPI version Documentation Status

Dataset sharing via cloud storage (S3, Google Storage)

Mental model

A dataset is an immutable collection of files organized in directories (e.g train/, val/).

A dataset can have a meta.json file, which is a collection of arbitraty key/value pairs.

Dataset can be local or remote. Local datasets are kept in ~/.etc/. Remote datasets are tgz files stored in cloud storage.

PyPI package etcetera provides:

  • a command-line utility etc
  • Python package etcetera

Using Command Line

etc -h
usage: etc [-h] {ls,register,pull,push,purge} ...

etcetera: managing cloud-hosted datasets

positional arguments:
  {ls,register,pull,push,purge}
                        command
    ls                  List datasets
    register            Register directory as a dataset
    pull                Pull dataset from repository
    push                Push dataset to the repository
    purge               Purge local dataset

optional arguments:
  -h, --help            show this help message and exit

Using Python

import etcetera as etc

dataset = etc.dataset('flower', auto_install=True)
dataset.keys()
>> { 'test', 'train' }

for filename in dataset['train'].iterdir():
    print(filename)
>> "~/.etc/flower/train/data00001.txt"
>> "~/.etc/flower/train/data00002.txt"

dataset.meta
>> {}
dataset.root
>> "~/.etc/flower"

Configuration

~/.etc.toml contains configuration for the service in TOML format. Example:

url = "s3://my-bucket"

Another example:

url = "s3://my-bucket"
public = false
aws_access_key_id = Axxxx
aws_secret_access_key = Axxx
endpoint_url = https://s3.amazonaws.com

A configuration file is required for remote operations (pull, push, ls -r).

URL value is required. The rest is optional.

  • url: URL of the remote repository. For example, s3://my-bucket.
  • public: set to true if you want push to create publicly-readable cloud files. Default is false.
  • aws_access_key_id, aws_secret_access_key, endpoint_url: configuration files to access AWS api. If not set, the defaults from global AWS config will be used.

Command-line example

etc ls
etc ls -r
etc pull MNIST
etc register <directory> as SuperMNIST

Creating a dataset

A dataset must have:

  1. data directory (non-empty)
  2. data directory must not have any files, only sub-directories (we call them "partitions")

Optional:

  1. meta.json
  2. README.md
  3. other sub-directories, for example assets/

A minimal dataset example

sample/
    data/
        train/
            data00001.json
            data00002.json
            data00003.json

A general dataset example

sample/
    README.md
    meta.json
    assets/
        Analysis.ipynb
        DataCleanup.ipynb
    data/
        train/
            data00001.json
            ...
        test/
            test00001.json
            ...
        val/
            val00001.json
            ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etcetera-0.0.6-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file etcetera-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: etcetera-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.0

File hashes

Hashes for etcetera-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9cb11c50fdc21e777e6449df997cf74e4dab3075c806a873a8979cf2d55dca07
MD5 fca861051f4dcd0b064fa714ba84bf3f
BLAKE2b-256 95435f7e2d400f46694f06e327bc35fdbc4c7aed32a054a258dbe6b95da0ff10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page