Skip to main content

Add your description here

Project description

PyPI version pre-commit DOI

Introduction

Full documentation can be found at http://datamaestro.rtfd.io

This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims at being

  1. a reference for available resources, listing datasets
  2. a tool to automatically download and process resources (when freely available)
  3. integration with the experimaestro experiment manager.
  4. (planned) a tool that allows to copy data from one computer to another

Each datasets is uniquely identified by a qualified name such as com.lecun.mnist, which is usually the inversed path to the domain name of the website associated with the dataset.

The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.

List of repositories

Command line interface (CLI)

The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing datamaestro COMMAND --help:

  • search search dataset by name, tags and/or tasks
  • download download files (if accessible on Internet) or ask for download path otherwise
  • prepare download dataset files and outputs a JSON containing path and other dataset information
  • repositories list the available repositories
  • orphans list data directories that do no correspond to any registered dataset (and allows to clean them up)
  • create-dataset creates a dataset definition

Example (CLI)

Retrieve and download

The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.

$ datamaestro search tag:image
[image] com.lecun.mnist

$ datamaestro prepare com.lecun.mnist
INFO:root:Materializing 4 resources
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/train_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/test_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz into .../datamaestro/store/com/lecun/test_labels.idx

The previous command also returns a JSON on standard output

{
  "train": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/train_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/train_labels.idx"
    }
  },
  "test": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/test_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/test_labels.idx"
    }
  },
  "id": "com.lecun.mnist"
}

For those using Python, this is even better since the IDX format is supported

In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))

Python definition of datasets

Datasets are defined as Python classes with resource attributes that describe how to download and process data. The framework automatically builds a dependency graph and handles downloads with two-path safety and state tracking.

from datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset


@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
    """The MNIST database of handwritten digits."""

    TRAIN_IMAGES = FileDownloader(
        "train_images.idx",
        "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
    )
    TRAIN_LABELS = FileDownloader(
        "train_labels.idx",
        "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
    )
    TEST_IMAGES = FileDownloader(
        "test_images.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
    )
    TEST_LABELS = FileDownloader(
        "test_labels.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
    )

    def config(self) -> ImageClassification:
        return ImageClassification.C(
            train=LabelledImages(
                images=IDX(path=self.TRAIN_IMAGES.path),
                labels=IDX(path=self.TRAIN_LABELS.path),
            ),
            test=LabelledImages(
                images=IDX(path=self.TEST_IMAGES.path),
                labels=IDX(path=self.TEST_LABELS.path),
            ),
        )

Its syntax is described in the documentation.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaestro-1.11.2.tar.gz (289.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamaestro-1.11.2-py3-none-any.whl (131.2 kB view details)

Uploaded Python 3

File details

Details for the file datamaestro-1.11.2.tar.gz.

File metadata

  • Download URL: datamaestro-1.11.2.tar.gz
  • Upload date:
  • Size: 289.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamaestro-1.11.2.tar.gz
Algorithm Hash digest
SHA256 bf10d7a602ce450b1e00e6d6be153eed0f606970a5dd19213479594542948692
MD5 6ec6f1e749f82072b736128d425e11de
BLAKE2b-256 6121c5d919e81d64e3e1c0e63e8b52a91a9768a527a1712e8bda95f8f1d4aeea

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamaestro-1.11.2.tar.gz:

Publisher: python-publish.yml on experimaestro/datamaestro

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datamaestro-1.11.2-py3-none-any.whl.

File metadata

  • Download URL: datamaestro-1.11.2-py3-none-any.whl
  • Upload date:
  • Size: 131.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamaestro-1.11.2-py3-none-any.whl
Algorithm Hash digest
SHA256 acd104c7ddc2ab2c0d712cbe794dc1f1d4c990389abfdd460a07006d83a81ccf
MD5 03072ddfdc4c05d3d79a7ce30cb0b07b
BLAKE2b-256 952137655bdc8aa812bb0e206600107d0c73d6100aaf47fcee9a9835638a60d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamaestro-1.11.2-py3-none-any.whl:

Publisher: python-publish.yml on experimaestro/datamaestro

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page