Skip to main content

Dataset librarian is a tool to download and apply the preprocessing needed for the list of supported datasets

Project description

Dataset Librarian

Installation Process

python -m pip install dataset-librarian

For more information check the dataset-librarian PyPI package

Datasets

Dataset name Description Download Preprocessing command
brca Breast Cancer dataset that contains categorized contrast enhanced mammography data and radiologists’ notes. supported A prerequisite: Use a browser, download the Low Energy and Subtracted images, then provide the path to the directory that contains the downloaded images using --directory argument. python -m dataset_librarian.dataset -n brca --download --preprocess -d <path to the dataset directory> --split_ratio 0.1
tabformer Credit card data for TabFormer supported not supported python -m dataset_librarian.dataset -n tabformer --download
dureader-vis DuReader-vis for document automation. Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. supported not supported python -m dataset_librarian.dataset -n dureader-vis --download
msmarco MS MARCO is a collection of datasets focused on deep learning in search supported not supported python -m dataset_librarian.dataset -n msmarco --download
mvtec-ad MVTEC Anomaly Detection DATASET for industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. supported supported python -m dataset_librarian.dataset -n mvtec-ad --download --preprocess -d <path to the dataset directory>

Command-line Interface

Input Arguments Description
--list (-l) list the supported datasets.
--name (-n) dataset name
--directory (-d) directory location where the raw dataset will be saved on your system. It's also where the preprocessed dataset files will be written. If not set, a directory with the dataset name will be created.
--download download the dataset specified.
--preprocess preprocess the dataset if supported.
--split_ratio split ratio of the test data, the default value is 0.1.

Python API

from dataset_librarian.dataset_api.download import download_dataset
from dataset_librarian.dataset_api.preprocess import preprocess_dataset

# Download the datasets
download_dataset('brca', <path to the raw dataset directory>)

# Preprocess the datasets
preprocess_dataset('brca', <path to the raw dataset directory>)

Building for source

Clone the Model Zoo for Intel® Architecture repository and navigate to the dataset_api directory.

git clone https://github.com/IntelAI/models.git
cd models/datasets/dataset_api
python -m pip install --upgrade pip build setuptools wheel
python -m pip install .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_librarian-1.0.4.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

dataset_librarian-1.0.4-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file dataset_librarian-1.0.4.tar.gz.

File metadata

  • Download URL: dataset_librarian-1.0.4.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for dataset_librarian-1.0.4.tar.gz
Algorithm Hash digest
SHA256 d969f7eac1459e052d6f4e1ed541e3daf5184b260081ae9edcb2f54bc7a69544
MD5 fcbafbad0f5292089ec2691c173620b8
BLAKE2b-256 35202b867808969ccc48d736ad7e6dcfaa0816d8a50dd06e85aae47d8bfa6cb9

See more details on using hashes here.

File details

Details for the file dataset_librarian-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for dataset_librarian-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0b63d224c88f60e8b6e8f203e35966693a8fde58b0c1c0f1a20524045c33d09c
MD5 91a9a318b5d2ffb29d7ce7bd102f7a5e
BLAKE2b-256 bf473552675b6fafca67c584ef1e7ad4b3da727f35f7161093624310bc043508

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page