Skip to main content

"Dataset management command line and API"

Project description

PyPI version pre-commit DOI

Introduction

Full documentation can be found at http://datamaestro.rtfd.io

This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims at being

  1. a reference for available resources, listing datasets
  2. a tool to automatically download and process resources (when freely available)
  3. integration with the experimaestro experiment manager.
  4. (planned) a tool that allows to copy data from one computer to another

Each datasets is uniquely identified by a qualified name such as com.lecun.mnist, which is usually the inversed path to the domain name of the website associated with the dataset.

The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.

List of repositories

Command line interface (CLI)

The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing datamaestro COMMAND --help:

  • search search dataset by name, tags and/or tasks
  • download download files (if accessible on Internet) or ask for download path otherwise
  • prepare download dataset files and outputs a JSON containing path and other dataset information
  • repositories list the available repositories
  • orphans list data directories that do no correspond to any registered dataset (and allows to clean them up)
  • create-dataset creates a dataset definition

Example (CLI)

Retrieve and download

The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.

$ datamaestro search tag:image
[image] com.lecun.mnist

$ datamaestro prepare com.lecun.mnist
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz: 32.8kB [00:00, 92.1kB/s]                                                            INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz: 9.92MB [00:00, 10.6MB/s]
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
...JSON...

The previous command also returns a JSON on standard output

{
  "train": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/train_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/train_labels.idx"
    }
  },
  "test": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/test_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/test_labels.idx"
    }
  },
  "id": "com.lecun.mnist"
}

For those using Python, this is even better since the IDX format is supported

In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))

Python definition of datasets

Each dataset (or a set of related datasets) is described in Python using a mix of declarative and imperative statements. This allows to quickly define how to download dataset using the datamaestro declarative API; the imperative part is used when creating the JSON output, and is integrated with experimaestro.

Its syntax is described in the documentation.

For MNIST, this corresponds to.

from datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage
from datamaestro.download.single import filedownloader
from datamaestro.definitions import  argument, datatasks, datatags, dataset
from datamaestro.data.tensor import IDX


@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
  ImageClassification,
  url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
  """The MNIST database

  The MNIST database of handwritten digits, available from this page, has a
  training set of 60,000 examples, and a test set of 10,000 examples. It is a
  subset of a larger set available from NIST. The digits have been
  size-normalized and centered in a fixed-size image.
  """
  return {
    "train": LabelledImages(
      images=IDXImage(path=train_images),
      labels=IDX(path=train_labels)
    ),
    "test": LabelledImages(
      images=IDXImage(path=test_images),
      labels=IDX(path=test_labels)
    ),
  }

0.8.0

  • Integration with other repositories: abstracting away the notion of dataset
  • Repository prefix
  • Set sub-datasets IDs automatically

0.7.3

  • Updates for new experimaestro (0.8.5)
  • Search types with "type:..."

0.6.17

  • Allow remote access through rpyc

0.6.9

version command

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaestro-1.2.1.tar.gz (62.1 kB view details)

Uploaded Source

Built Distribution

datamaestro-1.2.1-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file datamaestro-1.2.1.tar.gz.

File metadata

  • Download URL: datamaestro-1.2.1.tar.gz
  • Upload date:
  • Size: 62.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for datamaestro-1.2.1.tar.gz
Algorithm Hash digest
SHA256 64bc11426cbdd2df7b56fdfbb1691385daab97910e09d19022830c9d4b278462
MD5 4c5f7c35046a1ce3dce6fb429dfc7555
BLAKE2b-256 3a92235a92d7bd17f54116488a3d060f09cd455d28e905ff29af6ef5d3ea2e98

See more details on using hashes here.

File details

Details for the file datamaestro-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: datamaestro-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 61.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for datamaestro-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5f68c78773ec7bd3ce45ee463677a336e2f1c0817a70539b126e4a700c35a75c
MD5 d7c5d7218e915b4b6e66e1a24727018c
BLAKE2b-256 b7b1d92328ecdc6cbee415aa74b05404543330d5cb5d8879cb850631021b482b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page