Add your description here

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Project description

Introduction

Full documentation can be found at http://datamaestro.rtfd.io

This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims at being

a reference for available resources, listing datasets
a tool to automatically download and process resources (when freely available)
integration with the experimaestro experiment manager.
(planned) a tool that allows to copy data from one computer to another

Each datasets is uniquely identified by a qualified name such as com.lecun.mnist, which is usually the inversed path to the domain name of the website associated with the dataset.

The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.

List of repositories

Information Retrieval
NLP datasets
Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets
image-related dataset Image related datasets (e.g. MNIST)
machine learning
Generic machine learning datasets

Command line interface (CLI)

The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing datamaestro COMMAND --help:

search search dataset by name, tags and/or tasks
download download files (if accessible on Internet) or ask for download path otherwise
prepare download dataset files and outputs a JSON containing path and other dataset information
repositories list the available repositories
orphans list data directories that do no correspond to any registered dataset (and allows to clean them up)
create-dataset creates a dataset definition

Example (CLI)

Retrieve and download

The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.

$ datamaestro search tag:image
[image] com.lecun.mnist

$ datamaestro prepare com.lecun.mnist
INFO:root:Materializing 4 resources
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/train_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/test_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz into .../datamaestro/store/com/lecun/test_labels.idx

The previous command also returns a JSON on standard output

{
  "train": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/train_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/train_labels.idx"
    }
  },
  "test": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/test_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/test_labels.idx"
    }
  },
  "id": "com.lecun.mnist"
}

For those using Python, this is even better since the IDX format is supported

In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))

Python definition of datasets

Datasets are defined as Python classes with resource attributes that describe how to download and process data. The framework automatically builds a dependency graph and handles downloads with two-path safety and state tracking.

from datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset


@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
    """The MNIST database of handwritten digits."""

    TRAIN_IMAGES = FileDownloader(
        "train_images.idx",
        "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
    )
    TRAIN_LABELS = FileDownloader(
        "train_labels.idx",
        "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
    )
    TEST_IMAGES = FileDownloader(
        "test_images.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
    )
    TEST_LABELS = FileDownloader(
        "test_labels.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
    )

    def config(self) -> ImageClassification:
        return ImageClassification.C(
            train=LabelledImages(
                images=IDX(path=self.TRAIN_IMAGES.path),
                labels=IDX(path=self.TRAIN_LABELS.path),
            ),
            test=LabelledImages(
                images=IDX(path=self.TEST_IMAGES.path),
                labels=IDX(path=self.TEST_LABELS.path),
            ),
        )

Its syntax is described in the documentation.

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

1.15.1

Jul 1, 2026

1.15.0

Jun 18, 2026

1.14.0

May 21, 2026

1.13.0

May 21, 2026

1.12.0

May 21, 2026

1.11.3

May 4, 2026

1.11.2

Apr 23, 2026

1.11.0

Apr 23, 2026

1.10.4

Apr 23, 2026

1.10.3

Apr 23, 2026

1.10.2

Apr 23, 2026

1.10.1

Apr 22, 2026

1.10.0

Apr 22, 2026

1.9.10

Apr 10, 2026

1.9.9

Apr 10, 2026

1.9.8

Apr 10, 2026

1.9.7

Apr 10, 2026

1.9.6

Apr 3, 2026

1.9.5

Apr 3, 2026

1.9.4

Apr 3, 2026

1.9.3

Mar 14, 2026

1.9.2

Mar 13, 2026

1.9.1

Mar 13, 2026

1.9.0

Mar 12, 2026

1.8.3

Mar 9, 2026

1.8.2

Mar 9, 2026

1.8.1

Mar 9, 2026

1.8.0

Feb 3, 2026

1.7.4

Jan 29, 2026

1.7.3

Jan 29, 2026

1.7.2

Jan 29, 2026

1.7.1

Jan 29, 2026

1.7.0

Jan 29, 2026

1.6.2

Jan 1, 2026

1.6.1

Dec 24, 2025

1.6.0

Dec 21, 2025

1.5.2

Jul 10, 2025

1.5.1

Jul 7, 2025

1.5.0

Jul 6, 2025

1.4.5

Jul 6, 2025

1.4.4

Jul 6, 2025

1.4.3

Jun 11, 2025

1.4.2

May 12, 2025

1.4.1

Apr 3, 2025

1.4.0

Mar 30, 2025

1.3.2

Mar 27, 2025

1.3.1

Mar 27, 2025

1.3.0

Mar 27, 2025

1.2.1

May 31, 2024

1.2.0

May 31, 2024

1.1.0

Mar 6, 2024

1.0.6

Mar 6, 2024

1.0.5

Mar 4, 2024

1.0.4

Mar 4, 2024

1.0.3

Mar 1, 2024

1.0.2

Feb 29, 2024

1.0.1

Feb 28, 2024

1.0.0

Feb 26, 2024

0.8.16

Apr 5, 2023

0.8.15

Feb 3, 2023

0.8.14

Jan 26, 2023

0.8.13

Jan 20, 2023

0.8.12

Jan 17, 2023

0.8.11

Jan 16, 2023

0.8.10

Jan 16, 2023

0.8.9

Oct 18, 2022

0.8.8

Oct 14, 2022

0.8.7

May 17, 2022

0.8.6

May 12, 2022

0.8.5

Feb 10, 2022

0.8.4

Feb 10, 2022

0.8.3

Nov 19, 2021

0.8.1

Jul 20, 2021

0.8.0

Jul 19, 2021

0.7.4

May 23, 2021

0.7.3

Mar 18, 2021

0.7.2

Jan 29, 2021

0.7.1

Jan 28, 2021

0.7.0

Jan 27, 2021

0.6.24

Dec 15, 2020

0.6.23

Oct 18, 2020

0.6.22

Oct 9, 2020

0.6.21

Sep 24, 2020

0.6.20

Sep 18, 2020

0.6.19

Sep 11, 2020

0.6.18

Sep 11, 2020

0.6.17

Jul 9, 2020

0.6.16

May 22, 2020

0.6.15

May 22, 2020

0.6.13

Feb 25, 2020

0.6.12

Feb 24, 2020

0.6.11

Feb 18, 2020

0.6.10

Jan 16, 2020

0.6.9

Jan 13, 2020

0.6.8

Jan 13, 2020

0.6.6

Jan 7, 2020

0.6.5

Jan 7, 2020

0.6.4

Jan 7, 2020

0.6.3

Dec 20, 2019

0.6.2

Dec 20, 2019

0.6.1

Dec 20, 2019

0.6.0

Dec 20, 2019

0.5.8

Dec 19, 2019

0.5.7

Dec 4, 2019

0.5.2

Dec 4, 2019

0.5.0

Nov 30, 2019

0.2.8

Nov 25, 2019

0.2.5

Sep 23, 2019

0.2.4

Sep 17, 2019

0.2.3

Sep 13, 2019

0.2.2

Sep 13, 2019

0.2.1

Sep 13, 2019

0.2

Sep 10, 2019

0.1

Nov 14, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaestro-1.15.1.tar.gz (299.8 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datamaestro-1.15.1-py3-none-any.whl (141.3 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file datamaestro-1.15.1.tar.gz.

File metadata

Download URL: datamaestro-1.15.1.tar.gz
Upload date: Jul 1, 2026
Size: 299.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamaestro-1.15.1.tar.gz
Algorithm	Hash digest
SHA256	`d7da8099019a162bdc5be4f9ee7d86379b1f9fbe86cc802eb82478fb013c1656`
MD5	`ace8cecdcaeff361d032c214b6031acb`
BLAKE2b-256	`faad24c5e96bf1b9bbe3fd91a7547eaf5e7c74a9cb2f8bf8ea777a26f633f6d5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamaestro-1.15.1.tar.gz:

Publisher: python-publish.yml on experimaestro/datamaestro

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datamaestro-1.15.1.tar.gz
- Subject digest: d7da8099019a162bdc5be4f9ee7d86379b1f9fbe86cc802eb82478fb013c1656
- Sigstore transparency entry: 2035837200
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: experimaestro/datamaestro@9c37e59711c667d1a0c7945296e28d9b8e16f888
- Branch / Tag: refs/tags/v1.15.1
- Owner: https://github.com/experimaestro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@9c37e59711c667d1a0c7945296e28d9b8e16f888
- Trigger Event: release

File details

Details for the file datamaestro-1.15.1-py3-none-any.whl.

File metadata

Download URL: datamaestro-1.15.1-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 141.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamaestro-1.15.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce1424746dc2b6ab8e1442aff61e7293ec528aeac1f79854b65d3bb7b9b6dd44`
MD5	`7d68cd9420d711046ea6ce3313d07eb3`
BLAKE2b-256	`cddacc18a811fe9b45e59a7cdb6f58659270bc6ff51ca618f3539a60ceb855b9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamaestro-1.15.1-py3-none-any.whl:

Publisher: python-publish.yml on experimaestro/datamaestro

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datamaestro-1.15.1-py3-none-any.whl
- Subject digest: ce1424746dc2b6ab8e1442aff61e7293ec528aeac1f79854b65d3bb7b9b6dd44
- Sigstore transparency entry: 2035837306
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: experimaestro/datamaestro@9c37e59711c667d1a0c7945296e28d9b8e16f888
- Branch / Tag: refs/tags/v1.15.1
- Owner: https://github.com/experimaestro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@9c37e59711c667d1a0c7945296e28d9b8e16f888
- Trigger Event: release

datamaestro 1.15.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Introduction

List of repositories

Command line interface (CLI)

Example (CLI)

Retrieve and download

Python definition of datasets

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance