Skip to main content

Clean datasets for computer vision.

Project description


Visual Layer Logo

Open, Clean Datasets for Computer Vision


🔥 We use fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues · Read Blog · Get In Touch · About Us

Logo Logo Logo Logo Logo

What?

This repo shares clean version of publicly available computer vision datasets.

Why?

Even with the success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

How?

In this repo we share clean version of various computer vision datasets.

The datasets are cleaned using a free tool we released - fastdup.

We hope this effort will also help the community train better models and mitigate various model biases.

The cleaned image dataset should be free from most if not all of the following issues:

  • Duplicates.
  • Broken images.
  • Outliers.
  • Dark/Bright/Blurry images.

Datasets

Here are some of the datasets we are currently working on.

Dataset Issues
Food-101
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Oxford Pets
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenette
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Laion 1B
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenet-21k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenet-1k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
KITTI
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
DeepFashion
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Places365-standard
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
CelebA-HQ
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
ADE20K
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
COCO
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)

Setting Up

Prerequisites

Supported Python versions:

Supported Python: Ubuntu

Supported operating systems:

Supported OS: Ubuntu

Installation

Option 1 - Install vl_datasets package from PyPI.

pip install vl-datasets

Option 2 - Install the bleeding edge version on GitHub

pip install git+https://github.com/visual-layer/vl-datasets.git@master --upgrade

Getting Started

Import the clean version of dataset.

from vl_datasets import CleanFood101

Load the dataset into a PyTorch DataLoader.

train_dataset = CleanFood101('./', split='train', exclude_csv='food_101_vl-datasets_analysis.csv', transform=train_transform)
valid_dataset = CleanFood101('./', split='test', exclude_csv='food_101_vl-datasets_analysis.csv', transform=valid_transform)

Now you can use the dataset in a PyTorch training loop. Refer to our sample training notebooks for details.

Sample training notebooks:

Disclaimer

You are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vl_datasets-0.0.3-py3.10-none-any.whl (10.8 kB view details)

Uploaded Python 3

vl_datasets-0.0.3-py3.9-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file vl_datasets-0.0.3-py3.10-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.3-py3.10-none-any.whl
Algorithm Hash digest
SHA256 65b22338dfcc2994aa9604a948cca4d9630b1447d3f796b2eb244dc9bcb1877d
MD5 e25457ac0e75198972aba3a47775f0c5
BLAKE2b-256 f753fbd3ee29542271815fcf2a1640679735c699bf59f78dad332ef548182982

See more details on using hashes here.

File details

Details for the file vl_datasets-0.0.3-py3.9-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.3-py3.9-none-any.whl
Algorithm Hash digest
SHA256 0847584ae0f2df3f43820342d15d011d14cbb95f3831b54d01c4ec74d5447321
MD5 4dd67a98de5f98619206251e0f1f7bbd
BLAKE2b-256 b35fec7ab97a9eb74a0da9967d3d2e4d87b820ae4ba580437f6213a7b64ca01a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page