Find issues in image datasets
Project description
CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset!
Installation
pip install cleanvision
Quickstart
Download an example dataset (optional). Or just use any collection of image files you have.
wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
- Run CleanVision to audit the images.
from cleanvision import Imagelab
# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")
# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()
# Produce a neat report of the issues found in your dataset
imagelab.report()
- CleanVision diagnoses many types of issues, but you can also check for only specific issues.
issue_types = {"dark": {}, "blurry": {}}
imagelab.find_issues(issue_types=issue_types)
# Produce a report with only the specified issue_types
imagelab.report(issue_types=issue_types)
More resources
- Tutorial
- Documentation
- Blog
- Run CleanVision on a HuggingFace dataset
- Run CleanVision on a Torchvision dataset
- Example script that can be run with:
python examples/run.py --path <FOLDER_WITH_IMAGES> - Additional example notebooks
- FAQ
Clean your data for better Computer Vision
The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets.
This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision task such as: classification, segmentation, object detection, pose estimation, keypoint detection, generative modeling, etc. To detect issues in the labels of your image data, you can instead use the cleanlab package.
In any collection of image files (most formats supported), CleanVision can detect the following types of issues:
| Issue Type | Description | Issue Key | Example | |
|---|---|---|---|---|
| 1 | Exact Duplicates | Images that are identical to each other | exact_duplicates | |
| 2 | Near Duplicates | Images that are visually almost identical | near_duplicates | |
| 3 | Blurry | Images where details are fuzzy (out of focus) | blurry | |
| 4 | Low Information | Images lacking content (little entropy in pixel values) | low_information | |
| 5 | Dark | Irregularly dark images (underexposed) | dark | |
| 6 | Light | Irregularly bright images (overexposed) | light | |
| 7 | Grayscale | Images lacking color | grayscale | |
| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio | |
| 9 | Odd Size | Images that are abnormally large or small compared to the rest of the dataset | odd_size |
CleanVision supports Linux, macOS, and Windows and runs on Python 3.10+. Learn more from our blog.
Community
-
Interested in contributing? See the contributing guide. An easy starting point is to consider issues marked
good first issue. -
Ready to start adding your own code? See the development guide.
-
Have an issue? Search existing issues or submit a new issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanvision-0.3.7.tar.gz.
File metadata
- Download URL: cleanvision-0.3.7.tar.gz
- Upload date:
- Size: 45.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4a0bf1871b23963b35423e5ce0e25407e751e2c4b7b76005c5feea71319cf2e
|
|
| MD5 |
8454a034fbd22fa2f4ebb1e9a8f497b3
|
|
| BLAKE2b-256 |
640213447afd8e41f9ab6367ff399e45d58989e9b8c082d898bd32fa307712e2
|
File details
Details for the file cleanvision-0.3.7-py3-none-any.whl.
File metadata
- Download URL: cleanvision-0.3.7-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46ad8296a7750c354cef5ac39136f0d0e2c9bbdb88eda68c037877ed2702d74f
|
|
| MD5 |
cc532e0a14b55b575785ac3f74ad60ac
|
|
| BLAKE2b-256 |
501b7e2dbe29ed4d98cc18bbf56b76458cb52c55f3d69caa6ed284021fa061fc
|