Skip to main content

Open, Clean Datasets for Computer Vision.

Project description

PyPi PyPi PyPi License


Visual Layer Logo

VL-Datasets

Open, Clean, Curated Datasets for Computer Vision


🔥 We use fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues · Read Blog · Get In Touch · About Us

Logo Logo Logo Logo Logo

Description

vl-datasets is a Python package that provides access to clean computer vision datasets with only 2 lines of code.

For example, to get access to the clean version of the Food-101 dataset simply run:

image

We support some of the most widely used computer vision datasets. Let us know if you have additional request to support a new dataset.

All the datasets are analyzed for issues such as:

  • Duplicates.
  • Near Duplicates.
  • Broken images.
  • Outliers.
  • Dark/Bright/Blurry images.
  • Mislabels.

image

Why?

Computer vision is an exciting and rapidly advancing field, with new techniques and models emerging now and then. However, to develop and evaluate these models, it's essential to have reliable and standardized datasets to work with.

Even with the recent success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

We believe that access to clean and high-quality computer vision datasets leads to accurate, non-biased, and efficient model. By providing public access to vl-datasets we hope it helps advance the field of computer vision.

Datasets & Access

vl-datasets provides a convenient way to access the cleaned version of the datasets in Python.

Alternatively, for each dataset in this repo, we provide a .csv file that lists the problematic images from the dataset.

You can use the listed images in the .csv to improve the model by re-labeling the them or just simply remove it from the dataset.

We're a startup and we'd like to offer free access to the datasets as much as we can afford to. But in doing so, we'd also need your support.

We're offering select .csv files completely free with no strings attached. For access to our complete dataset and exclusive beta features, all we ask is that you sign up to be a beta tester – it's completely free and your feedback will help shape the future of our platform.

Here is a table of widely used computer vision datasets, issues we found and a link to access the .csv file.

Dataset Issues CSV Usage
Food-101
  • Duplicates - 0.233 % (235)
  • Outliers - 0.076 % (77)
  • Blur - Blur - 0.183 % (185)
  • Dark - 0.043 % (43)
  • Total - 0.535 % (540)
Download here. from vl_datasets import VLFood101
Oxford-IIIT Pet
  • Duplicates - 1.021% (75)
  • Outliers - 0.095% (7)
  • Dark - 0.054% (4)
  • Total - 1.170 % (86)
Download here. from vl_datasets import VLOxfordIIITPet
LAION-1B
  • Duplicates - WIP % (WIP)
  • Outliers - WIP % (WIP)
  • Broken - WIP % (WIP)
  • Blur - WIP % (WIP)
  • Dark - WIP % (WIP)
  • Bright - WIP % (WIP)
Request access here. WIP
ImageNet-21K
  • Duplicates - 11.853 % (1,559,120)
  • Outliers - 0.085 % (11,119)
  • Blur - 0.292 % (38,458)
  • Dark - 0.179 % (23,574)
  • Bright - 0.431 % (56,754)
  • Mislabels - 3.064 % (402,963)
  • Total - 15.904 % (2,091,988)
Request access here. WIP
ImageNet-1K
  • Duplicates - 0.520 % (6,660)
  • Outliers - 0.090 % (1,150)
  • Blur - 0.200 % (2,554)
  • Dark - 0.244 % (2,997)
  • Bright - 0.058 % (746)
  • Mislabels - 0.119 % (1,518)
  • Total - 1.221 % (15,625)
Request access here. WIP
KITTI
  • Duplicates - 15.294 % (2294)
  • Outliers - 0.107 % (16)
  • Total - 15.401 % (2310)
Request access here. WIP
DeepFashion
  • Duplicates - WIP % (WIP)
  • Outliers - WIP % (WIP)
  • Broken - WIP % (WIP)
  • Blur - WIP % (WIP)
  • Dark - WIP % (WIP)
  • Bright - WIP % (WIP)
Request access here. WIP
Places365
  • Duplicates - WIP % (WIP)
  • Outliers - WIP % (WIP)
  • Broken - WIP % (WIP)
  • Blur - WIP % (WIP)
  • Dark - WIP % (WIP)
  • Bright - WIP % (WIP)
Request access here. WIP
CelebA-HQ
  • Duplicates - 1.673 % (3,389)
  • Outliers - 0.077 % (157)
  • Blur - 0.512 % (1,037)
  • Dark - 0.009 % (18)
  • Mislabels - 0.006 % (13)
  • Total - 2.277 % (4,614)
Request access here. WIP
ADE20K
  • Duplicates - WIP % (WIP)
  • Outliers - WIP % (WIP)
  • Broken - WIP % (WIP)
  • Blur - WIP % (WIP)
  • Dark - WIP % (WIP)
  • Bright - WIP % (WIP)
Request access here. WIP
COCO
  • Duplicates - 0.123 % (201)
  • Outliers - 0.087 % (143)
  • Blur - 0.029 % (47)
  • Dark - 0.106 % (174)
  • Bright - 0.013 % (21)
  • Total - 0.358 % (586)
Request access here. WIP

Learn more on how we clean the datasets using our profilling tool here.

Installation

Option 1 - Install vl_datasets package from PyPI:

pip install vl-datasets

Option 2 - Install the bleeding edge version on GitHub:

pip install git+https://github.com/visual-layer/vl-datasets.git@main --upgrade

Usage

To start using vl-datasets, import the clean version of the dataset with:

from vl_datasets import VLFood101

This should import the clean version of the Food101 dataset.

Next, you can load the dataset as a PyTorch Dataset.

train_dataset = VLFood101('./', split='train')
valid_dataset = VLFood101('./', split='test')

If you have a custom .csv file you can optionally pass in the file:

train_dataset = VLFood101('./', split='train', exclude_csv='my-file.csv')

The filenames listed in the .csv will be excluded in the dataset.

Next, you can load the train and validation datasets in a PyTorch training loop.

See the Learn from Examples section to learn more.

NOTE: Sign up here for free to be our beta testers and get full access to the all the .csv files for the dataset listed in this repo.

With the dataset loaded you can train a model using PyTorch training loop.

Learn from Examples

  • Dataset: VLFood101
  • Framework: PyTorch.
  • Description: Load a dataset and train a PyTorch model.
  • Dataset: VLOxfordIIITPet
  • Framework: fast.ai.
  • Description: Finetune a pretrained TIMM model using fastai.

License

vl-datasets is licensed under the Apache 2.0 License. See LICENSE.

However, you are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

Usage Tracking

This repository incorporates usage tracking using Sentry.io to monitor and collect valuable information about the usage of the application.

Usage tracking allows us to gain insights into how the application is being used in real-world scenarios. It provides us with valuable information that helps in understanding user behavior, identifying potential issues, and making informed decisions to improve the application.

We DO NOT collect folder names, user names, image names, image content and other personaly identifiable information.

What data is tracked?

  • Errors and Exceptions: Sentry captures errors and exceptions that occur in the application, providing detailed stack traces and relevant information to help diagnose and fix issues.
  • Performance Metrics: Sentry collects performance metrics, such as response times, latency, and resource usage, enabling us to monitor and optimize the application's performance.

Read more on Sentry's official webpage.

Getting Help

Get help from the Visual Layer team or community members via the following channels -

About Visual-Layer

Visual Layer is founded by the authors of XGBoost, Apache TVM & Turi Create - Danny Bickson, Carlos Guestrin and Amir Alush.

Learn more about Visual Layer here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vl_datasets-0.0.8-py3.10-none-any.whl (18.2 kB view details)

Uploaded Python 3

vl_datasets-0.0.8-py3.9-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file vl_datasets-0.0.8-py3.10-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.8-py3.10-none-any.whl
Algorithm Hash digest
SHA256 25a36581bff73ceb310a184c1c8ed9774fefe9edd17e1f4b2b7f4df97a3e28e1
MD5 7acdac3202f0c318c418c1d53264c0a4
BLAKE2b-256 ca63c167584aadbfa486d713000b1214d039bf040424646355470f228abe1bec

See more details on using hashes here.

File details

Details for the file vl_datasets-0.0.8-py3.9-none-any.whl.

File metadata

File hashes

Hashes for vl_datasets-0.0.8-py3.9-none-any.whl
Algorithm Hash digest
SHA256 2032d2d8930f2223a56be2ac84e631d9cb34d9d174aa2c982464574082268e19
MD5 01331fab8bd760f470f9c865d4f5a98f
BLAKE2b-256 26832c92ec00501dbb37b494fdaea2f4be37369954f34b59f6a7b9916760a78e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page