A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
Project description
🧼🔎 SelfClean
A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.
Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)
NOTE: Make sure to have git-lfs
installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).
This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.
Installation
Install SelfClean via PyPI:
# upgrade pip to its latest version
pip install -U pip
# install selfclean
pip install selfclean
# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
Getting Started
You can run SelfClean in a few lines of code:
from selfclean import SelfClean
selfclean = SelfClean(
# displays the top-7 images from each error type
# per default this option is disabled
plot_top_N=7,
)
# run on pytorch dataset
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
input_path="path/to/images",
)
# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
Examples:
In examples/
, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:
- Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
- Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)
Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.
Development Environment
Run make
for a list of possible targets.
Run these commands to install the requirements for the development environment:
make init
make install
To run linters on all files:
pre-commit run --all-files
We use the following packages for code and test conventions:
black
for code styleisort
for import sortingpytest
for running tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file selfclean-0.0.31.tar.gz
.
File metadata
- Download URL: selfclean-0.0.31.tar.gz
- Upload date:
- Size: 108.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8881ae96986d8753e790f24ea5649fff8f0f7e7fa1e1673828cca4470096f8a9 |
|
MD5 | db9c33392dca7b596216d99a2dd1d0d9 |
|
BLAKE2b-256 | 6fd63d57f7f5a4f4f4ca714d1c0dbb0f3c5ff6283889484c8f8ad0a6c1097354 |
File details
Details for the file selfclean-0.0.31-py3-none-any.whl
.
File metadata
- Download URL: selfclean-0.0.31-py3-none-any.whl
- Upload date:
- Size: 172.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e375279be98c754633a53fed71eeeba1d62b7b8e419094ec389fc262f902d44 |
|
MD5 | ea65f86b0501dd3d30e44566c65f6cdc |
|
BLAKE2b-256 | 1de9dda926e9ba672b1076404518911fd9c63b40206f3ab594d51e630428806d |