Skip to main content

Add your description here

Project description

PyPI version pre-commit DOI

Introduction

Full documentation can be found at http://datamaestro.rtfd.io

This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims at being

  1. a reference for available resources, listing datasets
  2. a tool to automatically download and process resources (when freely available)
  3. integration with the experimaestro experiment manager.
  4. (planned) a tool that allows to copy data from one computer to another

Each datasets is uniquely identified by a qualified name such as com.lecun.mnist, which is usually the inversed path to the domain name of the website associated with the dataset.

The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.

List of repositories

Command line interface (CLI)

The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing datamaestro COMMAND --help:

  • search search dataset by name, tags and/or tasks
  • download download files (if accessible on Internet) or ask for download path otherwise
  • prepare download dataset files and outputs a JSON containing path and other dataset information
  • repositories list the available repositories
  • orphans list data directories that do no correspond to any registered dataset (and allows to clean them up)
  • create-dataset creates a dataset definition

Example (CLI)

Retrieve and download

The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.

$ datamaestro search tag:image
[image] com.lecun.mnist

$ datamaestro prepare com.lecun.mnist
INFO:root:Materializing 4 resources
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/train_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/test_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz into .../datamaestro/store/com/lecun/test_labels.idx

The previous command also returns a JSON on standard output

{
  "train": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/train_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/train_labels.idx"
    }
  },
  "test": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/test_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/test_labels.idx"
    }
  },
  "id": "com.lecun.mnist"
}

For those using Python, this is even better since the IDX format is supported

In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))

Python definition of datasets

Each dataset (or a set of related datasets) is described in Python using a mix of declarative and imperative statements. This allows to quickly define how to download dataset using the datamaestro declarative API; the imperative part is used when creating the JSON output, and is integrated with experimaestro.

Its syntax is described in the documentation.

For instance, the MNIST dataset can be described by the following

from datamaestro import dataset
from datamaestro.download.single import download_file
from datamaestro_image.data import ImageClassification, LabelledImages, IDXImage


@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
  ImageClassification,
  url="http://yann.lecun.com/exdb/mnist/",
)

    return ImageClassification(
        train=LabelledImages(
            images=IDXImage(path=train_images), labels=IDXImage(path=train_labels)
        ),
        test=LabelledImages(
            images=IDXImage(path=test_images), labels=IDXImage(path=test_labels)
        ),
    )

When building dataset modules, some extra documentation can be provided:

  ids: [com.lecun.mnist]
  entry_point: "datamaestro_image.config.com.lecun:mnist"
  title: The MNIST database
  url: http://yann.lecun.com/exdb/mnist/
  groups: [image-classification]
  description: |
    The MNIST database of handwritten digits, available from this page,
    has a training set of 60,000 examples, and a test set of 10,000
    examples. It is a subset of a larger set available from NIST. The
    digits have been size-normalized and centered in a fixed-size image.

This will allow to

  1. Document the dataset
  2. Allow to use the command line interface to manipulate it (download resources, etc.)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaestro-1.5.2.tar.gz (127.8 kB view details)

Uploaded Source

Built Distribution

datamaestro-1.5.2-py3-none-any.whl (62.8 kB view details)

Uploaded Python 3

File details

Details for the file datamaestro-1.5.2.tar.gz.

File metadata

  • Download URL: datamaestro-1.5.2.tar.gz
  • Upload date:
  • Size: 127.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for datamaestro-1.5.2.tar.gz
Algorithm Hash digest
SHA256 0d7899d9aec7dffadef94254fa9028a9e70400555770c8fcdf2944eaaa7f6b71
MD5 b013ab96433be6d5e21ebd9c0986c9aa
BLAKE2b-256 e7c8a61a0138b4e0f8db20f44441fd92bb4b6a063d86ce000c0c949df70d4ec9

See more details on using hashes here.

File details

Details for the file datamaestro-1.5.2-py3-none-any.whl.

File metadata

  • Download URL: datamaestro-1.5.2-py3-none-any.whl
  • Upload date:
  • Size: 62.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for datamaestro-1.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6834d4e8e50d0a6731c29c7504282d9b724c6e7ac1c4b99ff63dc097c08d8aab
MD5 7506a157e9e135ee5cc3e6643f9e91e5
BLAKE2b-256 b40805989dd23bf586e2944fe2042c52f9a095e5fc04279c603aa793bc4ea7ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page