Skip to main content

Open, Clean Datasets for Computer Vision.

Project description

PyPi PyPi PyPi License


Visual Layer Logo

Open, Clean Datasets for Computer Vision


🔥 We use fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues · Read Blog · Get In Touch · About Us

Logo Logo Logo Logo Logo

Description

vl-datasets is a collection of clean computer vision datasets, carefully analyzed and processed to avoid common image dataset issues such as:

  • Duplicates.
  • Broken images.
  • Outliers.
  • Dark/Bright/Blurry images.

For each dataset in this repo, we provide a .csv file that lists the problematic images from the dataset. You can use the listed images in the .csv to improve the model by re-labeling the them or just simply remove it from the dataset.

Why?

Computer vision is an exciting and rapidly advancing field, with new techniques and models emerging now and then. However, to develop and evaluate these models, it's essential to have reliable and standardized datasets to work with.

Even with the recent success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

We believe that access to clean and high-quality computer vision datasets leads to accurate, non-biased, and efficient model. By providing public access to vl-datasets we hope it helps advance the field of computer vision.

Datasets & Access

We're a startup and we'd like to offer free access to the datasets as much as we can afford to. But in doing so, we'd also need your support.

We're offering select .csv files completely free with no strings attached. For access to our complete dataset and exclusive beta features, all we ask is that you sign up to be a beta tester – it's completely free and your feedback will help shape the future of our platform.

Join us in unlocking the full potential of our data and revolutionizing the industry!

Here is a table of widely used computer vision datasets, issues we found and a link to access the .csv file.

Dataset Issues (WIP) CSV
Food-101
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Download here.
Oxford-IIIT Pet
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Download here.
Imagenette
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Download here.
LAION-1B
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
Imagenet-21k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
Imagenet-1k
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
KITTI
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
DeepFashion
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
Places365
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
CelebA-HQ
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
ADE20K
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.
COCO
  • Duplicates - 0.24% (12,345)
  • Outliers - 0.85% (456)
  • Broken - 0.85% (456)
  • Blur - 0.85% (456)
  • Dark - 0.85% (456)
  • Bright - 0.85% (456)
Sign up here.

Installation

Option 1 - Install vl_datasets package from PyPI.

pip install vl-datasets

Option 2 - Install the bleeding edge version on GitHub

pip install git+https://github.com/visual-layer/vl-datasets.git@main --upgrade

Usage

To start using vl-datasets, you can import the clean version of the dataset with:

from vl_datasets import CleanFood101

This should import the clean version of the Food101 dataset.

Next, you can load the dataset as a PyTorch DataLoader.

train_dataset = CleanFood101('./', split='train', exclude_csv='food_101_vl-datasets_analysis.csv', transform=train_transform)
valid_dataset = CleanFood101('./', split='test', exclude_csv='food_101_vl-datasets_analysis.csv', transform=valid_transform)

Now you can use the dataset in a PyTorch training loop. Refer to our sample training notebooks for details.

Learn from Examples

  • Dataset: CleanFood101
  • Framework: PyTorch.
  • Description: Train a simple PyTorch model with the CleanFood101 dataset.
  • Dataset: CleanPets
  • Framework: fast.ai.
  • Description: Train a simple TIMM model using fastai.

License

vl-datasets is licensed under the Apache 2.0 License. See LICENSE.

However, you are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vl_datasets-0.0.4-py3.10-none-any.whl (11.8 kB view details)

Uploaded Python 3

vl_datasets-0.0.4-py3.9-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file vl_datasets-0.0.4-py3.10-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.4-py3.10-none-any.whl
Algorithm Hash digest
SHA256 ba9014cc1e68060320af7d3b18f18c112280feec1167c5afafaf310cc96dbc19
MD5 1bfe8de50fda42cd21c60732f3cf2a1d
BLAKE2b-256 ea729ab4843a5a55772fa4752cbca4fdcd359f4b2e5aa09397b6ae2382e0fa1d

See more details on using hashes here.

File details

Details for the file vl_datasets-0.0.4-py3.9-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.4-py3.9-none-any.whl
Algorithm Hash digest
SHA256 8677d17a9dac8884ba7016b370a2f49dee67d95222dc5b29fafa4b6f9e4fee53
MD5 7655b6c7e286ed6468ff434aa91920a0
BLAKE2b-256 a5f4067f4683fd35b5e1b1de0cd8dd2c7189fe985a9558384d1823f47e4b504f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page