vl-datasets

Clean datasets for computer vision.

These details have not been verified by PyPI

Project links

Homepage

Project description

Open, Clean Datasets for Computer Vision

🔥 We use fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues · Read Blog · Get In Touch · About Us

What?

This repo shares clean version of publicly available computer vision datasets.

Why?

Even with the success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

How?

In this repo we share clean version of various computer vision datasets.

The datasets are cleaned using a free tool we released - fastdup.

We hope this effort will also help the community train better models and mitigate various model biases.

The cleaned image dataset should be free from most if not all of the following issues:

Duplicates.
Broken images.
Outliers.
Low information images (dark/bright/blurry images).

Datasets

Here are some of the datasets we are currently working on.

Dataset	Issues
Food-101	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
Oxford Pets	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
Imagenette	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
Laion 1B	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
Imagenet-21k	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
Imagenet-1k	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
KITTI	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
DeepFashion	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
Places365-standard	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
CelebA-HQ	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
ADE20K	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)
COCO	Duplicates - 0.24% (12,345) Outliers - 0.85% (456) Broken - 0.85% (456) Blur - 0.85% (456) Dark - 0.85% (456) Bright - 0.85% (456)

Getting Started

Install vl_datasets package from PyPI.

pip install vl-datasets

Import the clean version of dataset.

from vl_datasets import CleanFood101

Load the dataset into a PyTorch DataLoader.

train_dataset = CleanFood101('./', split='train', exclude_csv='food_101_vl-datasets_analysis.csv', transform=train_transform)
valid_dataset = CleanFood101('./', split='test', exclude_csv='food_101_vl-datasets_analysis.csv', transform=valid_transform)

Now you can use the dataset in a PyTorch training loop. Refer to our sample training notebooks for details.

Sample training notebooks:

Disclaimer

You are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.11

May 24, 2023

0.0.10

May 24, 2023

0.0.8

May 22, 2023

0.0.7

May 17, 2023

0.0.6

May 16, 2023

0.0.5

May 15, 2023

0.0.4

May 9, 2023

0.0.3

May 8, 2023

0.0.2

May 8, 2023

This version

0.0.1

May 8, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vl_datasets-0.0.1-py3.10-none-any.whl (10.6 kB view details)

Uploaded May 8, 2023 Python 3

vl_datasets-0.0.1-py3.9-none-any.whl (10.6 kB view details)

Uploaded May 8, 2023 Python 3

File details

Details for the file vl_datasets-0.0.1-py3.10-none-any.whl.

File metadata

Download URL: vl_datasets-0.0.1-py3.10-none-any.whl
Upload date: May 8, 2023
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for vl_datasets-0.0.1-py3.10-none-any.whl
Algorithm	Hash digest
SHA256	`ec075db92d78513e496c9646789f956d6a9b9510e408062aa103d6613852a5db`
MD5	`5e12ec6044ec54864d20fcc2a0abb28c`
BLAKE2b-256	`4d2d768e32c86e5d60cb4b17802ca056ce57ca752b8166ff36d29b5df1621b5f`

See more details on using hashes here.

File details

Details for the file vl_datasets-0.0.1-py3.9-none-any.whl.

File metadata

Download URL: vl_datasets-0.0.1-py3.9-none-any.whl
Upload date: May 8, 2023
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for vl_datasets-0.0.1-py3.9-none-any.whl
Algorithm	Hash digest
SHA256	`92e51e0e7f7da9462aa1649bf47f1e3a302a3ed8be703569fd659b303e43c0f5`
MD5	`c4814aecd64638cfa44bcbf9bd4db257`
BLAKE2b-256	`ee7cefc9809673c0edcbe59324b144be26dd09b9d89c7bb110c7ecced480444a`