Sharing datasets via cloud storage

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Web Environment
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python
- Python :: 3

Project description

etcetera

Dataset sharing via cloud storage (S3, Google Storage)

Mental model

A dataset is an immutable collection of files organized in directories (e.g train/, val/).

A dataset can have a meta.json file, which is a collection of arbitraty key/value pairs.

Dataset can be local or remote. Local datasets are kept in ~/.etc/. Remote datasets are tgz files stored in cloud storage.

PyPI package etcetera provides:

a command-line utility etc
Python package etcetera

Using Command Line

etc -h
usage: etc [-h] {ls,register,pull,push,purge} ...

etcetera: managing cloud-hosted datasets

positional arguments:
  {ls,register,pull,push,purge}
                        command
    ls                  List datasets
    register            Register directory as a dataset
    pull                Pull dataset from repository
    push                Push dataset to the repository
    purge               Purge local dataset

optional arguments:
  -h, --help            show this help message and exit

Using Python

import etcetera as etc

dataset = etc.dataset('flower', auto_install=True)
dataset.keys()
>> { 'test', 'train' }

for filename in dataset['train'].iterdir():
    print(filename)
>> "~/.etc/flower/train/data00001.txt"
>> "~/.etc/flower/train/data00002.txt"

dataset.meta
>> {}
dataset.root
>> "~/.etc/flower"

Installing

pip install 'etcetera[s3]'

Installs etceters with the support for S3 cloud.

Configuration

~/.etc.toml contains configuration for the service in TOML format. Example:

url = "s3://my-bucket"

Another example:

url = "s3://my-bucket"
public = false
aws_access_key_id = "Axxxx"
aws_secret_access_key = "Kxxx"
endpoint_url = "https://s3.amazonaws.com"

A configuration file is required for remote operations (pull, push, ls -r). It is not required for local operations (ls, register).

In configuration file url value is required. The rest is optional.

url: URL of the remote repository. For example, s3://my-bucket.
public: set to true if you want push to create publicly-readable cloud files. Default is false.
aws_access_key_id, aws_secret_access_key, endpoint_url: configuration files to access AWS api. If not set, the defaults from global AWS config will be used.

Command-line example

etc ls
etc ls -r
etc pull MNIST
etc register <directory> as SuperMNIST

Creating a dataset

A dataset must have:

data directory (non-empty)
data directory must not have any files, only sub-directories (we call them "partitions")

Optional:

meta.json
README.md
other sub-directories, for example assets/

A minimal dataset example

sample/
    data/
        train/
            data00001.json
            data00002.json
            data00003.json

A general dataset example

sample/
    README.md
    meta.json
    assets/
        Analysis.ipynb
        DataCleanup.ipynb
    data/
        train/
            data00001.json
            ...
        test/
            test00001.json
            ...
        val/
            val00001.json
            ...

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Web Environment
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.12

Jun 22, 2020

0.0.11

Jun 22, 2020

0.0.10

Jun 21, 2020

0.0.8

May 19, 2020

0.0.7

May 19, 2020

0.0.6

May 19, 2020

0.0.5

May 19, 2020

0.0.4

May 19, 2020

0.0.2

May 19, 2020

0.0.1

May 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

etcetera-0.0.12-py3-none-any.whl (9.9 kB view hashes)

Uploaded Jun 22, 2020 Python 3

Hashes for etcetera-0.0.12-py3-none-any.whl

Hashes for etcetera-0.0.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b19fafbe904d5f131e60fa68005b935050a71505622dcae062a6f331ed3d487a`
MD5	`64a7fc3b461ab489e80a9264327e49bb`
BLAKE2b-256	`acd6a3f1c4e0555f9d02b6cc390ec99d67e33d4fdec2961a59ef42ffb9647729`