Datasets manager for the Best of ACRV
Project description
~Please note this is only a beta release at this stage~
ACRV Datasets: dataset integration for Best of ACRV projects
Note: support will be added soon for datasets that require end-users accept of licensing agreements
The ACRV Datasets package is a light wrapper for generically managing datasets. The package supports any dataset, as long as it has a public URL. We emphasise that we do not own the datasets accessed through this package, we simply provide easy access and integration for projects like the Best of ACRV codebases.
Datasets are defined in a YAML file, and there is full support for grouping sub-datasets together. For example, 'coco'
can be used to refer to 13 different COCO datasets with a single identifier. You can also easily add your own datasets simply by editing the same datasets YAML file. Once added, datasets can be downloaded and accessed from Python with simple function calls.
Our code is free to use, and licensed under BSD-3. If you use any datasets in your work, you must appropriately reference the original dataset authors! Please see dataset references below.
Installing the ACRV Datasets package
We offer the following methods for installing the ACRV Datasets package:
- Through our Conda and Pip packages: single command installs the package and Python dependences (these are equivalent as there are no system dependencies)
- Directly from source: allows easy editing and extension of our code, but you take care of building and all dependencies
Conda and Pip
The ACRV Datasets package has no system dependencies, so installation is the same for both Conda & Pip package management systems.
For Pip, simply install via:
u@pc:~$ pip install acrv-datasets
Installation via Conda is the same once you have Conda installed on your system, and are inside a Conda environment. From there, simply run:
u@pc:~$ conda install acrv-datasets
From source
Installing from source is very similar to the pip
method above due to the package only containing Python code. Simply clone the repository, enter the directory, and install via pip
:
u@pc:~$ pip install -e .
Note: the editable mode flag (-e
) is optional, but allows you to immediately use any changes you make to the code in your local Python ecosystem.
Downloading & accessing datasets
This package exposes a simple Python interface that automatically handles downloading, extracting, and accessing datasets. All of this complexity is hidden behind a single user action: getting datasets. For example to "get" the NYU dataset:
import acrv_datasets as ad
nyu_location = ad.get_datasets(['nyu'])
When calling get_datasets()
, the dataset will be downloaded and extracted if it doesn't already exist. For example the exact same call above works if you don't already have the 'nyu'
dataset, it will just block and report progress while it gathers the dataset.
Datasets are stored in a default directory, which can be configured via the following code:
import acrv_datasets as ad
ad.set_datasets_directory('/mnt/hdd/acrv_datasets')
From this point on, all dataset operations would be performed in the /mnt/hdd/acrv_datasets
directory. If no location has been set, a default will be used which is printed in yellow before all operations. You can also explicitly override the dataset directory for single operations:
import acrv_datasets as ad
ad.get_datasets(['nyu'], 'mnt/hdd2/other_location')
You can see a live list of supported datasets, and access a dictionary containing each dataset's details, with the following code:
import acrv_datasets as ad
details = ad.supported_datasets()
The module can also be accessed directly from the command line using the python3 -m acrv_datasets ...
syntax. Equivalent commands for the above Python are shown below:
u@pc:~$ python3 -m acrv_datasets --datasets nyu
u@pc:~$ python3 -m acrv_datasets --set-default-datasets-directory /mnt/hdd/acrv_datasets
u@pc:~$ python3 -m acrv_datasets --datasets nyu --datasets-directory /mnt/hdd/acrv_datasets
u@pc:~$ python3 -m acrv_datasets --supported-datasets
There is also a help flag which documents the supported syntax:
u@pc:~$ python3 -m acrv_datasets --help
Adding your own datasets
New datasets can be added by making additions to the 'datasets.yaml'
file. All that is needed is a unique dataset identifier, and a public URL.
A detailed description of the syntax for adding new datasets is provided at the top of the file:
Datasets are listed in named groups. The group name is the top level key, the dataset name is the second level key, and the public URL is the second level value. The group name & dataset name combine to form a unique dataset identifier.
For example, the following would specify a 2014 & 2021 version of my dataset called 'my_dataset' (with the unique identifiers 'my_dataset/2014' & 'my_dataset/2021' respectively):
my_dataset:
2014: https://my_dataset.hosting/2014.tgz
2021: https://my_dataset.hosting/2021.tgz
For brevity the dataset name can be omitted if there is only 1 dataset in a group. For example, the following gives a dataset with the identifier 'my_simple_dataset':
my_simple_dataset: https://my_dataset.hosting/simply.tgz
Dataset references
We again emphasise that you are required to meet all of the licensing terms of the specific dataset if you wish to use the dataset in your own work (we merely provide simplified access).
Below is a list of all datasets identifiers currently available grouped by their owner, with a link provided. Please follow the owner's citation instructions if using their datasets in your research:
- NYUv2:
nyu
- Pascal VOC:
voc
- SBD:
sbd
- COCO:
coco
,coco/train2014
,coco/val2014
,coco/train2014
,coco/val2014
,coco/annotations_trainval2014
,coco/test2015
,coco/train2017
,coco/val2017
,coco/annotations_trainval2017
,coco/captions
,coco/vqa_questions_train
,coco/vqa_questions_val
,coco/vqa_questions_test
,coco/vqa_annotations_train
,coco/vqa_annotations_val
- GloVe:
glove
- TODO???:
trainval36
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for acrv_datasets-0.9.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f0e438fa406f8c154cb092a4f40f802227418674592503d52b2d5d5dd1eb760 |
|
MD5 | 800cb0102cb7e40813f7f672c672e793 |
|
BLAKE2b-256 | 2ff59b18681e3c1103a80d5dcc115768adc9fcc76880752a0e9d17a842036f91 |