Dataset librarian is a tool to download and apply the preprocessing needed for the list of supported datasets
Project description
Dataset Librarian
Installation Process
python -m pip install dataset-librarian
For more information check the dataset-librarian PyPI package
Datasets
Dataset name | Description | Download | Preprocessing | command |
---|---|---|---|---|
brca |
Breast Cancer dataset that contains categorized contrast enhanced mammography data and radiologists’ notes. | supported | A prerequisite: Use a browser, download the Low Energy and Subtracted images, then provide the path to the directory that contains the downloaded images using --directory argument. |
python -m dataset_librarian.dataset -n brca --download --preprocess -d <path to the dataset directory> --split_ratio 0.1 |
tabformer |
Credit card data for TabFormer | supported | not supported | python -m dataset_librarian.dataset -n tabformer --download |
dureader-vis |
DuReader-vis for document automation. Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. | supported | not supported | python -m dataset_librarian.dataset -n dureader-vis --download |
msmarco |
MS MARCO is a collection of datasets focused on deep learning in search | supported | not supported | python -m dataset_librarian.dataset -n msmarco --download |
mvtec-ad |
MVTEC Anomaly Detection DATASET for industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. | supported | supported | python -m dataset_librarian.dataset -n mvtec-ad --download --preprocess -d <path to the dataset directory> |
Command-line Interface
Input Arguments | Description |
---|---|
--list (-l) | list the supported datasets. |
--name (-n) | dataset name |
--directory (-d) | directory location where the raw dataset will be saved on your system. It's also where the preprocessed dataset files will be written. If not set, a directory with the dataset name will be created. |
--download | download the dataset specified. |
--preprocess | preprocess the dataset if supported. |
--split_ratio | split ratio of the test data, the default value is 0.1. |
Python API
from dataset_librarian.dataset_api.download import download_dataset
from dataset_librarian.dataset_api.preprocess import preprocess_dataset
# Download the datasets
download_dataset('brca', <path to the raw dataset directory>)
# Preprocess the datasets
preprocess_dataset('brca', <path to the raw dataset directory>)
Building for source
Clone the Model Zoo for Intel® Architecture repository and navigate to the dataset_api
directory.
git clone https://github.com/IntelAI/models.git
cd models/datasets/dataset_api
python -m pip install --upgrade pip build setuptools wheel
python -m pip install .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dataset_librarian-1.0.4.tar.gz
(29.8 kB
view hashes)
Built Distribution
Close
Hashes for dataset_librarian-1.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b63d224c88f60e8b6e8f203e35966693a8fde58b0c1c0f1a20524045c33d09c |
|
MD5 | 91a9a318b5d2ffb29d7ce7bd102f7a5e |
|
BLAKE2b-256 | bf473552675b6fafca67c584ef1e7ad4b3da727f35f7161093624310bc043508 |