Dataset librarian is a tool to download and apply the preprocessing needed for the list of supported datasets
Project description
Dataset Librarian
Installation Process
python -m pip install dataset-librarian
For more information check the dataset-librarian PyPI package
Datasets
Dataset name | Description | Download | Preprocessing | command |
---|---|---|---|---|
brca |
Breast Cancer dataset that contains categorized contrast enhanced mammography data and radiologists’ notes. | supported | A prerequisite: Use a browser, download the Low Energy and Subtracted images, then provide the path to the directory that contains the downloaded images using --directory argument. |
python -m dataset_librarian.dataset -n brca --download --preprocess -d <path to the dataset directory> |
tabformer |
Credit card data for TabFormer | supported | not supported | python -m dataset_librarian.dataset -n tabformer --download |
dureader-vis |
DuReader-vis for document automation. Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. | supported | not supported | python -m dataset_librarian.dataset -n dureader-vis --download |
msmarco |
MS MARCO is a collection of datasets focused on deep learning in search | supported | not supported | python -m dataset_librarian.dataset -n msmarco --download |
mvtec-ad |
MVTEC Anomaly Detection DATASET for industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. | supported | supported | python -m dataset_librarian.dataset -n mvtec-ad --download --preprocess -d <path to the dataset directory> |
Command-line Interface
Input Arguments | Description |
---|---|
--list (-l) | list the supported datasets. |
--name (-n) | dataset name |
--directory (-d) | directory location where the raw dataset will be saved on your system. It's also where the preprocessed dataset files will be written. If not set, a directory with the dataset name will be created. |
--download | download the dataset specified. |
--preprocess | preprocess the dataset if supported. |
Python API
from dataset_librarian.dataset_api.download import download_dataset
from dataset_librarian.dataset_api.preprocess import preprocess_dataset
# Download the datasets
download_dataset('brca', <path to the raw dataset directory>)
# Preprocess the datasets
preprocess_dataset('brca', <path to the raw dataset directory>)
Building for source
Clone the Model Zoo for Intel® Architecture repository and navigate to the dataset_api
directory.
git clone https://github.com/IntelAI/models.git
cd models/datasets/dataset_api
python -m pip install --upgrade pip build setuptools wheel
python -m pip install .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dataset_librarian-1.0.0.tar.gz
(28.5 kB
view hashes)
Built Distribution
Close
Hashes for dataset_librarian-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e45bf64708609d95f6480c9752b588c0617e11e76200329be6d4e2e6b6043c9 |
|
MD5 | d1549fdd69c75f69e9504ce22a8e0a24 |
|
BLAKE2b-256 | 9f660a4ed2468a69eecc281858d7ec081a89a934a41d487c35331a616a66c49f |