Add your description here
Project description
Introduction
Full documentation can be found at http://datamaestro.rtfd.io
This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims at being
- a reference for available resources, listing datasets
- a tool to automatically download and process resources (when freely available)
- integration with the experimaestro experiment manager.
- (planned) a tool that allows to copy data from one computer to another
Each datasets is uniquely identified by a qualified name such as com.lecun.mnist, which is usually the inversed path to the domain name of the website associated with the dataset.
The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.
List of repositories
-
NLP and information access related dataset
Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets -
image-related dataset
Image related datasets (e.g. MNIST)
-
machine learning
Generic machine learning datasets
Command line interface (CLI)
The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing datamaestro COMMAND --help:
searchsearch dataset by name, tags and/or tasksdownloaddownload files (if accessible on Internet) or ask for download path otherwisepreparedownload dataset files and outputs a JSON containing path and other dataset informationrepositorieslist the available repositoriesorphanslist data directories that do no correspond to any registered dataset (and allows to clean them up)create-datasetcreates a dataset definition
Example (CLI)
Retrieve and download
The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.
$ datamaestro search tag:image
[image] com.lecun.mnist
$ datamaestro prepare com.lecun.mnist
INFO:root:Materializing 4 resources
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/train_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz into .../datamaestro/store/com/lecun/test_images.idx
INFO:root:Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz into .../datamaestro/store/com/lecun/test_labels.idx
The previous command also returns a JSON on standard output
{
"train": {
"images": {
"path": ".../data/image/com/lecun/mnist/train_images.idx"
},
"labels": {
"path": ".../data/image/com/lecun/mnist/train_labels.idx"
}
},
"test": {
"images": {
"path": ".../data/image/com/lecun/mnist/test_images.idx"
},
"labels": {
"path": ".../data/image/com/lecun/mnist/test_labels.idx"
}
},
"id": "com.lecun.mnist"
}
For those using Python, this is even better since the IDX format is supported
In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))
Python definition of datasets
Datasets are defined as Python classes with resource attributes that describe how to download and process data. The framework automatically builds a dependency graph and handles downloads with two-path safety and state tracking.
from datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset
@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
"""The MNIST database of handwritten digits."""
TRAIN_IMAGES = FileDownloader(
"train_images.idx",
"http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
)
TRAIN_LABELS = FileDownloader(
"train_labels.idx",
"http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
)
TEST_IMAGES = FileDownloader(
"test_images.idx",
"http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
)
TEST_LABELS = FileDownloader(
"test_labels.idx",
"http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
)
def config(self) -> ImageClassification:
return ImageClassification.C(
train=LabelledImages(
images=IDX(path=self.TRAIN_IMAGES.path),
labels=IDX(path=self.TRAIN_LABELS.path),
),
test=LabelledImages(
images=IDX(path=self.TEST_IMAGES.path),
labels=IDX(path=self.TEST_LABELS.path),
),
)
Its syntax is described in the documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamaestro-1.8.0.tar.gz.
File metadata
- Download URL: datamaestro-1.8.0.tar.gz
- Upload date:
- Size: 203.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1b4e71fdaead644e3fb03b97543e50105d30e4074975ccd214741b439f0196c
|
|
| MD5 |
35071633c6e585e3caa74deb6b99a9ad
|
|
| BLAKE2b-256 |
aed11a53ea31ebcbfd8a8aa7d14e779ebb9de3ce63b70a0450e75cea5aebb17d
|
Provenance
The following attestation bundles were made for datamaestro-1.8.0.tar.gz:
Publisher:
python-publish.yml on experimaestro/datamaestro
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamaestro-1.8.0.tar.gz -
Subject digest:
c1b4e71fdaead644e3fb03b97543e50105d30e4074975ccd214741b439f0196c - Sigstore transparency entry: 908194564
- Sigstore integration time:
-
Permalink:
experimaestro/datamaestro@84f70dd799e9a0081dbb47f6a8c09cb864f45599 -
Branch / Tag:
refs/tags/v1.8.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@84f70dd799e9a0081dbb47f6a8c09cb864f45599 -
Trigger Event:
release
-
Statement type:
File details
Details for the file datamaestro-1.8.0-py3-none-any.whl.
File metadata
- Download URL: datamaestro-1.8.0-py3-none-any.whl
- Upload date:
- Size: 85.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5408bd4f7d392609a6b772ba926e9255335f3d1e91a46413c2003f365d62ba33
|
|
| MD5 |
580643279deec76b6406c797717564fc
|
|
| BLAKE2b-256 |
5b6c1dfa691e5f4b904813e7685a8f73b1d83153825399e28c43f2cb82f08f9d
|
Provenance
The following attestation bundles were made for datamaestro-1.8.0-py3-none-any.whl:
Publisher:
python-publish.yml on experimaestro/datamaestro
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamaestro-1.8.0-py3-none-any.whl -
Subject digest:
5408bd4f7d392609a6b772ba926e9255335f3d1e91a46413c2003f365d62ba33 - Sigstore transparency entry: 908194565
- Sigstore integration time:
-
Permalink:
experimaestro/datamaestro@84f70dd799e9a0081dbb47f6a8c09cb864f45599 -
Branch / Tag:
refs/tags/v1.8.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@84f70dd799e9a0081dbb47f6a8c09cb864f45599 -
Trigger Event:
release
-
Statement type: