Skip to main content

A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates and label errors.

Project description

🧼🔎 SelfClean

Test and Coverage

SelfClean Teaser

A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates, and label errors.

Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)

NOTE: Make sure to have git-lfs installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).

This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.

cc by nc

Installation

Install SelfClean via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install selfclean
pip install selfclean

# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean

Getting Started

You can run SelfClean in a few lines of code:

from selfclean import SelfClean

selfclean = SelfClean(
    # displays the top-7 images from each error type
    # per default this option is disabled
    plot_top_N=7,
)

# run on pytorch dataset
issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
    input_path="path/to/images",
)

# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_off_topic_samples = issues.get_issues("off_topic_samples", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)

Examples: In examples/, we've provided some example notebooks where you will learn how to analyze and clean datasets using SelfClean. These examples analyze different benchmark datasets such as:

Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.

More Ressources

Reference

If you find this repository useful for your research, please cite the following work.

@article{groger_selfclean_2024,
  title        = {{Intrinsic Self-Supervision for Data Quality Audits}},
  shorttitle   = {{SelfClean}},
  author       = {Gr\"oger, Fabian and Lionetti, Simone and Gottfrois, Philippe and Gonzalez-Jimenez, Alvaro and Amruthalingam, Ludovic and Consortium, Labelling and Groh, Matthew and Navarini, Alexander A. and Pouly, Marc},
  year         = 2024,
  month        = 12,
  journal      = {Advances in Neural Information Processing Systems (NeurIPS)},
}

Development Environment

Run make for a list of possible targets.

Run these commands to install the requirements for the development environment:

make init
make install

To run linters on all files:

pre-commit run --all-files

We use the following packages for code and test conventions:

  • black for code style
  • isort for import sorting
  • pytest for running tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selfclean-0.0.38.tar.gz (108.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selfclean-0.0.38-py3-none-any.whl (166.0 kB view details)

Uploaded Python 3

File details

Details for the file selfclean-0.0.38.tar.gz.

File metadata

  • Download URL: selfclean-0.0.38.tar.gz
  • Upload date:
  • Size: 108.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for selfclean-0.0.38.tar.gz
Algorithm Hash digest
SHA256 c421d175a2e8c464c4e71e29e03e10c8a5867579fbb2d0b889ae8cd81d4bd0c8
MD5 ad7aee4715a7332177a89919bcb98a6b
BLAKE2b-256 08f70514b1f588f1aa9ba36a44af52d31ad1a2b2b31d87c7db3f7a42e2750be6

See more details on using hashes here.

File details

Details for the file selfclean-0.0.38-py3-none-any.whl.

File metadata

  • Download URL: selfclean-0.0.38-py3-none-any.whl
  • Upload date:
  • Size: 166.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for selfclean-0.0.38-py3-none-any.whl
Algorithm Hash digest
SHA256 b6ec6f7956825c2ce9b3c184041654c586556d9401752e327e4b840543f4973b
MD5 8bbeb9dc2a84d779698ec063c050406c
BLAKE2b-256 b791e961a862c8cf3d739305fdc8b92b76b3725ceaa2a995892c9e5febf7c420

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page