Skip to main content

Clean datasets for computer vision.

Project description


Visual Layer Logo

Open, Clean Datasets for Computer Vision


🔥 We use fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues · Read Blog · Get In Touch · About Us

Logo Logo Logo Logo Logo

What?

This repo shares clean version of publicly available computer vision datasets.

Why?

Even with the success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

How?

In this repo we share clean version of various computer vision datasets.

The datasets are cleaned using a free tool we released - fastdup.

We hope this effort will also help the community train better models and mitigate various model biases.

The cleaned image dataset should be free from most if not all of the following issues:

  • Duplicates.
  • Broken images.
  • Outliers.
  • Low information images (dark/bright/blurry images).

Datasets

Here are some of the datasets we are currently working on.

Dataset Issues
Food-101
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Oxford Pets
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenette
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Laion 1B
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenet-21k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenet-1k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
KITTI
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
DeepFashion
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Places365-standard
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
CelebA-HQ
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
ADE20K
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
COCO
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)

Getting Started

Install vl_datasets package from PyPI.

pip install vl-datasets

Import the clean version of dataset.

from vl_datasets import CleanFood101

Load the dataset into a PyTorch DataLoader.

train_dataset = CleanFood101('./', split='train', exclude_csv='food_101_vl-datasets_analysis.csv', transform=train_transform)
valid_dataset = CleanFood101('./', split='test', exclude_csv='food_101_vl-datasets_analysis.csv', transform=valid_transform)

Now you can use the dataset in a PyTorch training loop. Refer to our sample training notebooks for details.

Sample training notebooks:

Disclaimer

You are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vl_datasets-0.0.1-py3.10-none-any.whl (10.6 kB view details)

Uploaded Python 3

vl_datasets-0.0.1-py3.9-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file vl_datasets-0.0.1-py3.10-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.1-py3.10-none-any.whl
Algorithm Hash digest
SHA256 ec075db92d78513e496c9646789f956d6a9b9510e408062aa103d6613852a5db
MD5 5e12ec6044ec54864d20fcc2a0abb28c
BLAKE2b-256 4d2d768e32c86e5d60cb4b17802ca056ce57ca752b8166ff36d29b5df1621b5f

See more details on using hashes here.

File details

Details for the file vl_datasets-0.0.1-py3.9-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.1-py3.9-none-any.whl
Algorithm Hash digest
SHA256 92e51e0e7f7da9462aa1649bf47f1e3a302a3ed8be703569fd659b303e43c0f5
MD5 c4814aecd64638cfa44bcbf9bd4db257
BLAKE2b-256 ee7cefc9809673c0edcbe59324b144be26dd09b9d89c7bb110c7ecced480444a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page