A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
Project description
🧼🔎 SelfClean
A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.
Publications: SelfClean Paper | Data Cleaning Protocol Paper (ML4H24@NeurIPS)
NOTE: Make sure to have git-lfs
installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).
This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.
Installation
Install SelfClean via PyPI:
# upgrade pip to its latest version
pip install -U pip
# install selfclean
pip install selfclean
# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
Getting Started
You can run SelfClean in a few lines of code:
from selfclean import SelfClean
selfclean = SelfClean()
# run on pytorch dataset
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
input_path="path/to/images",
)
# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
Examples:
In examples/
, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:
- Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
- Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)
Development Environment
Run make
for a list of possible targets.
Run these commands to install the requirements for the development environment:
make init
make install
To run linters on all files:
pre-commit run --all-files
We use the following packages for code and test conventions:
black
for code styleisort
for import sortingpytest
for running tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for selfclean-0.0.23-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a410bd9fef6d1ddab2c71b1baf2ace104f80c20e2d6e570f48d170731e9d8db6 |
|
MD5 | 5665b4b6759d3fb5940aa83de6a5775a |
|
BLAKE2b-256 | 67560adea2aa76da65f10a6488565de35f02e3f7d84c2a1836ece3e197d01fff |