Skip to main content

Dataset librarian is a tool to download and apply the preprocessing needed for the list of supported datasets

Project description

Dataset Librarian

Installation Process

python -m pip install dataset-librarian

For more information check the dataset-librarian PyPI package

Datasets

Dataset name Description Download Preprocessing command
brca Breast Cancer dataset that contains categorized contrast enhanced mammography data and radiologists’ notes. supported A prerequisite: Use a browser, download the Low Energy and Subtracted images, then provide the path to the directory that contains the downloaded images using --directory argument. python -m dataset_librarian.dataset -n brca --download --preprocess -d <path to the dataset directory> --split_ratio 0.1
tabformer Credit card data for TabFormer supported not supported python -m dataset_librarian.dataset -n tabformer --download
dureader-vis DuReader-vis for document automation. Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. supported not supported python -m dataset_librarian.dataset -n dureader-vis --download
msmarco MS MARCO is a collection of datasets focused on deep learning in search supported not supported python -m dataset_librarian.dataset -n msmarco --download
mvtec-ad MVTEC Anomaly Detection DATASET for industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. supported supported python -m dataset_librarian.dataset -n mvtec-ad --download --preprocess -d <path to the dataset directory>

Command-line Interface

Input Arguments Description
--list (-l) list the supported datasets.
--name (-n) dataset name
--directory (-d) directory location where the raw dataset will be saved on your system. It's also where the preprocessed dataset files will be written. If not set, a directory with the dataset name will be created.
--download download the dataset specified.
--preprocess preprocess the dataset if supported.
--split_ratio split ratio of the test data, the default value is 0.1.

Python API

from dataset_librarian.dataset_api.download import download_dataset
from dataset_librarian.dataset_api.preprocess import preprocess_dataset

# Download the datasets
download_dataset('brca', <path to the raw dataset directory>)

# Preprocess the datasets
preprocess_dataset('brca', <path to the raw dataset directory>)

Building for source

Clone the Model Zoo for Intel® Architecture repository and navigate to the dataset_api directory.

git clone https://github.com/IntelAI/models.git
cd models/datasets/dataset_api
python -m pip install --upgrade pip build setuptools wheel
python -m pip install .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_librarian-1.0.4.tar.gz (29.8 kB view hashes)

Uploaded Source

Built Distribution

dataset_librarian-1.0.4-py3-none-any.whl (21.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page