Public datasets loaders
Project description
Public datasets plugin
This project implements the public datasets (CIFAR10 / SVHN / MNIST) plugin for the DEEL dataset manager.
A deel dataset plugin is an extension of the Dataset class defined in the DEEL dataset manager project.
It allows to access to specific dataset files using the load
method and other defined modes.
Public datasets (CIFAR10 / SVHN / MNIST) dataset plugin use the default mode path
to load.
-
MNIST:
train-images-idx3-ubyte.gz
,train-labels-idx1-ubyte.gz
,t10k-images-idx3-ubyte.gz
,t10k-labels-idx1-ubyte.gz
,
-
CIFAR10:
cifar-10-python.tar.gz
,
-
SVHN:
housenumbers/train.tar.gz
,housenumbers/test.tar.gz
,housenumbers/extra.tar.gz
,
using the http protocol.
Installation
The latest release can be installed from pypi. All needed python packages will also be installed as a dependency.
pip install public-datasets
Otherwize the ssh or HTTPS version should work but you will have to enter your credentials manually:
# SSH version (with proper SSH key setup):
pip install git+ssh://git@github.com:deel-ai/public_datasets.git
# HTTPS version:
pip install git+https://github.com/deel-ai/public_datasets.git
Note:
- CIFAR10 dataset loading name is
cifra10
, - SVHN dataset loading name is
svhn
, - MNIST dataset loading name is
mnist
.
Examples of usage
Basic usage
To load one of public datasets (CIFAR10 / SVHN / MNIST), you can simply do:
import deel.datasets
# Load the default mode of mnist dataset:
mnist_data_path = deel.datasets.load("mnist")
# Load the default mode of svhn dataset:
svhn_data_path = deel.datasets.load("svhn")
# Load the default mode of cifra10 dataset:
cifra10_data_path = deel.datasets.load("cifra10")
The deel.datasets.load
function is the basic entry to access the datasets.
By passing with_info=True
, extra information can be retrieved as a python
dictionary. Information are not standardized, so each dataset may provide
different ones:
The mode
argument can be used to load different "version" of the dataset. By default,
only the path
mode is available and will return the path to the local folder
containing the dataset.
Command line utilities
The deel-datasets
package comes with some command line utilities that can be accessed using:
python -m deel.datasets ARGS...
The --help
option can be used to view the full capabilities of the command line program.
By default, the program uses the configuration at $HOME/.deel/config.yml
, but the -c
argument can be used to specified a custom configuration file.
The following commands are available (not exhaustive):
list
— List the available datasets. If the configuration specify a remote provider (e.g., WebDAV), this will list the datasets available remotely. To list the dataset already downloaded, you can use the--local
option.
$ python -m deel.datasets list
Listing datasets at https://datasets.deel.ai:
dataset-a: 3.0.1 [latest], 3.0.0
dataset-b: 1.0 [latest]
dataset-c: 1.0 [latest]
$ python -m deel.datasets list --local
Listing datasets at /opt/datasets:
dataset-a: 3.0.1 [latest], 3.0.0
dataset-c: 1.0 [latest]
download NAME[:VERSION]
— Download the specified dataset. If the configuration does not specify a remote provider, this does nothing except outputing some information. The:VERSION
can be omitted, in which case:latest
is implied. To force the re-download of a dataset, the--force
option can be used.
cas de MNIST
$ python -m deel.datasets download mnist
Fetching mnist...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-j83d2nc3 because the default path (/home/<user>/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train-images-idx3-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.45M/9.45M [00:00<00:00, 11.2Mbytes/s]
Extracting train-images-idx3-ubyte.gz: 44.9Mbytes [00:00, 248Mbytes/s]
train-labels-idx1-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.2k/28.2k [00:00<00:00, 13.4Mbytes/s]
Extracting train-labels-idx1-ubyte.gz: 58.6kbytes [00:00, 132Mbytes/s]
t10k-images-idx3-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.57M/1.57M [00:00<00:00, 9.04Mbytes/s]
Extracting t10k-images-idx3-ubyte.gz: 7.48Mbytes [00:00, 246Mbytes/s]
t10k-labels-idx1-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.44k/4.44k [00:00<00:00, 55.2Mbytes/s]
Extracting t10k-labels-idx1-ubyte.gz: 9.77kbytes [00:00, 59.0Mbytes/s]
convert train images: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [00:10<00:00, 5554.88it/s]
convert test images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 5526.01it/s]
Dataset mnist loaded and stored at '/home/<user>/.deel/datasets/mnist/1.0.0'.
cas de SVHN
$ python -m deel.datasets download svhn
python -m deel.datasets download svhn
Fetching svhn...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-gl2vzgmi because the default path (/home/justin.plakoo/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train_32x32.mat: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 174M/174M [00:47<00:00, 3.84Mbytes/s]
test_32x32.mat: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61.3M/61.3M [00:25<00:00, 2.50Mbytes/s]
extra_32x32.mat: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [05:32<00:00, 4.00Mbytes/s]
.....
Dataset svhn loaded and stored at '/home/<user>/.deel/datasets/svhn/1.0.0'.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for public_datasets-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0638eb9cd3a67dbff0b1d5a27d78825a034a0a4e94b4de0e8b80ec536d3e9174 |
|
MD5 | 8c83f91bf73c220520e756612633f7f8 |
|
BLAKE2b-256 | bce780797fd4f7f278db22639043787005ab6634486aca6536570db31079fbcd |