A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
Project description
🧼🔎 SelfClean
A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.
Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)
NOTE: Make sure to have git-lfs
installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).
This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.
Installation
Install SelfClean via PyPI:
# upgrade pip to its latest version
pip install -U pip
# install selfclean
pip install selfclean
# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
Getting Started
You can run SelfClean in a few lines of code:
from selfclean import SelfClean
selfclean = SelfClean()
# run on pytorch dataset
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
input_path="path/to/images",
)
# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
Examples:
In examples/
, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:
- Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
- Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)
Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.
Development Environment
Run make
for a list of possible targets.
Run these commands to install the requirements for the development environment:
make init
make install
To run linters on all files:
pre-commit run --all-files
We use the following packages for code and test conventions:
black
for code styleisort
for import sortingpytest
for running tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for selfclean-0.0.28-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b144d3377e579f96d7f749b2ce3848a6653938aa0d26fd8e6bff999a812f6319 |
|
MD5 | 657cd4a3ddf1a080c87fe411cd0e6f0c |
|
BLAKE2b-256 | 34ce324d245917dc43ffb71350cdbadc322503ca9f01683ba20f1257697ff76a |