Skip to main content

Public datasets loaders

Project description

Public datasets plugin

This project implements the public datasets (CIFAR10 / SVHN / MNIST) plugin for the DEEL dataset manager.

A deel dataset plugin is an extension of the Dataset class defined in the DEEL dataset manager project. It allows to access to specific dataset files using the load method and other defined modes.

Public datasets (CIFAR10 / SVHN / MNIST) dataset plugin use the default mode pathto load.

  • MNIST:

    • train-images-idx3-ubyte.gz,
    • train-labels-idx1-ubyte.gz,
    • t10k-images-idx3-ubyte.gz,
    • t10k-labels-idx1-ubyte.gz,
  • CIFAR10:

    • cifar-10-python.tar.gz,
  • SVHN:

    • housenumbers/train.tar.gz,
    • housenumbers/test.tar.gz,
    • housenumbers/extra.tar.gz,

using the http protocol.

Installation

The latest release can be installed from pypi. All needed python packages will also be installed as a dependency.

pip install public-datasets

Otherwize the ssh or HTTPS version should work but you will have to enter your credentials manually:

# SSH version (with proper SSH key setup):
pip install git+ssh://git@github.com:deel-ai/public_datasets.git

# HTTPS version:
pip install git+https://github.com/deel-ai/public_datasets.git

Note:

  • CIFAR10 dataset loading name is cifra10,
  • SVHN dataset loading name is svhn,
  • MNIST dataset loading name is mnist.

Examples of usage

Basic usage

To load one of public datasets (CIFAR10 / SVHN / MNIST), you can simply do:

import deel.datasets

# Load the default mode of mnist dataset:
mnist_data_path = deel.datasets.load("mnist")

# Load the default mode of svhn dataset:
svhn_data_path = deel.datasets.load("svhn")

# Load the default mode of cifra10 dataset:
cifra10_data_path = deel.datasets.load("cifra10")

The deel.datasets.load function is the basic entry to access the datasets. By passing with_info=True, extra information can be retrieved as a python dictionary. Information are not standardized, so each dataset may provide different ones: The mode argument can be used to load different "version" of the dataset. By default, only the path mode is available and will return the path to the local folder containing the dataset.

Command line utilities

The deel-datasets package comes with some command line utilities that can be accessed using:

python -m deel.datasets ARGS...

The --help option can be used to view the full capabilities of the command line program. By default, the program uses the configuration at $HOME/.deel/config.yml, but the -c argument can be used to specified a custom configuration file.

The following commands are available (not exhaustive):

  • list — List the available datasets. If the configuration specify a remote provider (e.g., WebDAV), this will list the datasets available remotely. To list the dataset already downloaded, you can use the --local option.
$ python -m deel.datasets list
Listing datasets at https://datasets.deel.ai:
  dataset-a: 3.0.1 [latest], 3.0.0
  dataset-b: 1.0 [latest]
  dataset-c: 1.0 [latest]
$ python -m deel.datasets list --local
Listing datasets at /opt/datasets:
  dataset-a: 3.0.1 [latest], 3.0.0
  dataset-c: 1.0 [latest]
  • download NAME[:VERSION] — Download the specified dataset. If the configuration does not specify a remote provider, this does nothing except outputing some information. The :VERSION can be omitted, in which case :latest is implied. To force the re-download of a dataset, the --force option can be used.

cas de MNIST

$ python -m deel.datasets download mnist
Fetching mnist...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-j83d2nc3 because the default path (/home/<user>/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train-images-idx3-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.45M/9.45M [00:00<00:00, 11.2Mbytes/s]
Extracting train-images-idx3-ubyte.gz: 44.9Mbytes [00:00, 248Mbytes/s]
train-labels-idx1-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.2k/28.2k [00:00<00:00, 13.4Mbytes/s]
Extracting train-labels-idx1-ubyte.gz: 58.6kbytes [00:00, 132Mbytes/s]
t10k-images-idx3-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.57M/1.57M [00:00<00:00, 9.04Mbytes/s]
Extracting t10k-images-idx3-ubyte.gz: 7.48Mbytes [00:00, 246Mbytes/s]
t10k-labels-idx1-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.44k/4.44k [00:00<00:00, 55.2Mbytes/s]
Extracting t10k-labels-idx1-ubyte.gz: 9.77kbytes [00:00, 59.0Mbytes/s]
convert train images: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [00:10<00:00, 5554.88it/s]
convert test images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 5526.01it/s]
Dataset mnist loaded and stored at '/home/<user>/.deel/datasets/mnist/1.0.0'.

cas de SVHN

$ python -m deel.datasets download svhn
python -m deel.datasets download svhn
Fetching svhn...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-gl2vzgmi because the default path (/home/justin.plakoo/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train_32x32.mat: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 174M/174M [00:47<00:00, 3.84Mbytes/s]
test_32x32.mat: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61.3M/61.3M [00:25<00:00, 2.50Mbytes/s]
extra_32x32.mat: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [05:32<00:00, 4.00Mbytes/s]
.....
Dataset svhn loaded and stored at '/home/<user>/.deel/datasets/svhn/1.0.0'.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

public-datasets-0.0.2.tar.gz (9.5 kB view hashes)

Uploaded Source

Built Distribution

public_datasets-0.0.2-py3-none-any.whl (12.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page