Public datasets loaders
Project description
Public datasets plugin
This project implements the public datasets (CIFAR10 / SVHN / MNIST) plugin for the DEEL dataset manager.
A deel dataset plugin is an extension of the Dataset class defined in the DEEL dataset manager project.
It allows to access to specific dataset files using the load method and other defined modes.
Public datasets (CIFAR10 / SVHN / MNIST) dataset plugin use the default mode pathto load.
-
MNIST:
train-images-idx3-ubyte.gz,train-labels-idx1-ubyte.gz,t10k-images-idx3-ubyte.gz,t10k-labels-idx1-ubyte.gz,
-
CIFAR10:
cifar-10-python.tar.gz,
-
SVHN:
housenumbers/train.tar.gz,housenumbers/test.tar.gz,housenumbers/extra.tar.gz,
using the http protocol.
Installation
The latest release can be installed from pypi. All needed python packages will also be installed as a dependency.
pip install public-datasets
Otherwize the ssh or HTTPS version should work but you will have to enter your credentials manually:
# SSH version (with proper SSH key setup):
pip install git+ssh://git@github.com:deel-ai/public_datasets.git
# HTTPS version:
pip install git+https://github.com/deel-ai/public_datasets.git
Note:
- CIFAR10 dataset loading name is
cifra10, - SVHN dataset loading name is
svhn, - MNIST dataset loading name is
mnist.
Examples of usage
Basic usage
To load one of public datasets (CIFAR10 / SVHN / MNIST), you can simply do:
import deel.datasets
# Load the default mode of mnist dataset:
mnist_data_path = deel.datasets.load("mnist")
# Load the default mode of svhn dataset:
svhn_data_path = deel.datasets.load("svhn")
# Load the default mode of cifra10 dataset:
cifra10_data_path = deel.datasets.load("cifra10")
The deel.datasets.load function is the basic entry to access the datasets.
By passing with_info=True, extra information can be retrieved as a python
dictionary. Information are not standardized, so each dataset may provide
different ones:
The mode argument can be used to load different "version" of the dataset. By default,
only the path mode is available and will return the path to the local folder
containing the dataset.
Command line utilities
The deel-datasets package comes with some command line utilities that can be accessed using:
python -m deel.datasets ARGS...
The --help option can be used to view the full capabilities of the command line program.
By default, the program uses the configuration at $HOME/.deel/config.yml, but the -c
argument can be used to specified a custom configuration file.
The following commands are available (not exhaustive):
list— List the available datasets. If the configuration specify a remote provider (e.g., WebDAV), this will list the datasets available remotely. To list the dataset already downloaded, you can use the--localoption.
$ python -m deel.datasets list
Listing datasets at https://datasets.deel.ai:
dataset-a: 3.0.1 [latest], 3.0.0
dataset-b: 1.0 [latest]
dataset-c: 1.0 [latest]
$ python -m deel.datasets list --local
Listing datasets at /opt/datasets:
dataset-a: 3.0.1 [latest], 3.0.0
dataset-c: 1.0 [latest]
download NAME[:VERSION]— Download the specified dataset. If the configuration does not specify a remote provider, this does nothing except outputing some information. The:VERSIONcan be omitted, in which case:latestis implied. To force the re-download of a dataset, the--forceoption can be used.
cas de MNIST
$ python -m deel.datasets download mnist
Fetching mnist...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-j83d2nc3 because the default path (/home/<user>/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train-images-idx3-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.45M/9.45M [00:00<00:00, 11.2Mbytes/s]
Extracting train-images-idx3-ubyte.gz: 44.9Mbytes [00:00, 248Mbytes/s]
train-labels-idx1-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.2k/28.2k [00:00<00:00, 13.4Mbytes/s]
Extracting train-labels-idx1-ubyte.gz: 58.6kbytes [00:00, 132Mbytes/s]
t10k-images-idx3-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.57M/1.57M [00:00<00:00, 9.04Mbytes/s]
Extracting t10k-images-idx3-ubyte.gz: 7.48Mbytes [00:00, 246Mbytes/s]
t10k-labels-idx1-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.44k/4.44k [00:00<00:00, 55.2Mbytes/s]
Extracting t10k-labels-idx1-ubyte.gz: 9.77kbytes [00:00, 59.0Mbytes/s]
convert train images: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [00:10<00:00, 5554.88it/s]
convert test images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 5526.01it/s]
Dataset mnist loaded and stored at '/home/<user>/.deel/datasets/mnist/1.0.0'.
cas de SVHN
$ python -m deel.datasets download svhn
python -m deel.datasets download svhn
Fetching svhn...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-gl2vzgmi because the default path (/home/justin.plakoo/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train_32x32.mat: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 174M/174M [00:47<00:00, 3.84Mbytes/s]
test_32x32.mat: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61.3M/61.3M [00:25<00:00, 2.50Mbytes/s]
extra_32x32.mat: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [05:32<00:00, 4.00Mbytes/s]
.....
Dataset svhn loaded and stored at '/home/<user>/.deel/datasets/svhn/1.0.0'.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file public-datasets-0.0.2.tar.gz.
File metadata
- Download URL: public-datasets-0.0.2.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
faef582e9f94accc13e822c80586cec69168290b5a9227cb718366d2b2dabc86
|
|
| MD5 |
4499fd85915646341a8ad609dee9e6cc
|
|
| BLAKE2b-256 |
6e83ba76cf01fdc655fdab22294aa5f4c5adb83a1cd150d2d8dd027890994707
|
File details
Details for the file public_datasets-0.0.2-py3-none-any.whl.
File metadata
- Download URL: public_datasets-0.0.2-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0638eb9cd3a67dbff0b1d5a27d78825a034a0a4e94b4de0e8b80ec536d3e9174
|
|
| MD5 |
8c83f91bf73c220520e756612633f7f8
|
|
| BLAKE2b-256 |
bce780797fd4f7f278db22639043787005ab6634486aca6536570db31079fbcd
|