Skip to main content

Clean datasets for computer vision.

Project description


Visual Layer Logo

Open, Clean Datasets for Computer Vision


🔥 We use fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues · Read Blog · Get In Touch · About Us

Logo Logo Logo Logo Logo

What?

This repo shares clean version of publicly available computer vision datasets.

Why?

Even with the success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

How?

In this repo we share clean version of various computer vision datasets.

The datasets are cleaned using a free tool we released - fastdup.

We hope this effort will also help the community train better models and mitigate various model biases.

The cleaned image dataset should be free from most if not all of the following issues:

  • Duplicates.
  • Broken images.
  • Outliers.
  • Low information images (dark/bright/blurry images).

Datasets

Here are some of the datasets we are currently working on.

Dataset Issues
Food-101
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Oxford Pets
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenette
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Laion 1B
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenet-21k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Imagenet-1k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
KITTI
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
DeepFashion
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Places365-standard
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
CelebA-HQ
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
ADE20K
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
COCO
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)

Getting Started

Install vl_datasets package from PyPI.

pip install vl-datasets

Import the clean version of dataset.

from vl_datasets import CleanFood101

Load the dataset into a PyTorch DataLoader.

train_dataset = CleanFood101('./', split='train', exclude_csv='food_101_vl-datasets_analysis.csv', transform=train_transform)
valid_dataset = CleanFood101('./', split='test', exclude_csv='food_101_vl-datasets_analysis.csv', transform=valid_transform)

Now you can use the dataset in a PyTorch training loop. Refer to our sample training notebooks for details.

Sample training notebooks:

Disclaimer

You are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vl_datasets-0.0.2-py3.10-none-any.whl (10.6 kB view details)

Uploaded Python 3

vl_datasets-0.0.2-py3.9-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file vl_datasets-0.0.2-py3.10-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.2-py3.10-none-any.whl
Algorithm Hash digest
SHA256 759a79c70b5ad126d7bd17678e27406996edabb72b6ef2329341127ec2e59103
MD5 bc0da664c6917ea68b48f5bcade36f81
BLAKE2b-256 6314b33b34fa8c384856607c016673160192510edd763ca01a8049d129da3744

See more details on using hashes here.

File details

Details for the file vl_datasets-0.0.2-py3.9-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.2-py3.9-none-any.whl
Algorithm Hash digest
SHA256 fa2860c502251307f90b1fed2a8e90b263bd7e4ef7da017537c053351475e2bf
MD5 5d3f1f5eb90d006c00e32d2dae86a83a
BLAKE2b-256 0b5f3b3d4505db2cdf47362def51c1f776e1c5da622d281375aaae5d8da01c35

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page