A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
Project description
🧼🔎 SelfClean
A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.
Publications: SelfClean Paper | Data Cleaning Protocol Paper (ML4H24@NeurIPS)
NOTE: Make sure to have git-lfs
installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).
This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.
Installation
Install SelfClean via PyPI:
# upgrade pip to its latest version
pip install -U pip
# install selfclean
pip install selfclean
# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
Getting Started
You can run SelfClean in a few lines of code:
from selfclean import SelfClean
selfclean = SelfClean()
# run on pytorch dataset
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
input_path="path/to/images",
)
# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
Examples:
In examples/
, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:
- Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
- Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)
Development Environment
Run make
for a list of possible targets.
Run these commands to install the requirements for the development environment:
make init
make install
To run linters on all files:
pre-commit run --all-files
We use the following packages for code and test conventions:
black
for code styleisort
for import sortingpytest
for running tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for selfclean-0.0.22-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fbd39350240ca299011ddefc7c22b97f2764f631f2c927f35e8419821f48923 |
|
MD5 | cbf0f3ff65ebedd859906a83b3ce4181 |
|
BLAKE2b-256 | 3573cb2e7106306c918917f6d07f26ab988cf79cea7504e53d53dfeda2ba9b58 |