Skip to main content

A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.

Project description

🧼🔎 SelfClean

Test and Coverage

SelfClean Teaser

A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.

Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)

NOTE: Make sure to have git-lfs installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).

This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.

cc by nc

Installation

Install SelfClean via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install selfclean
pip install selfclean

# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean

Getting Started

You can run SelfClean in a few lines of code:

from selfclean import SelfClean

selfclean = SelfClean(
    # displays the top-7 images from each error type
    # per default this option is disabled
    plot_top_N=7, 
)

# run on pytorch dataset
issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
    input_path="path/to/images",
)

# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)

Examples: In examples/, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean. These examples analyze different benchmark datasets such as:

Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.

Development Environment

Run make for a list of possible targets.

Run these commands to install the requirements for the development environment:

make init
make install

To run linters on all files:

pre-commit run --all-files

We use the following packages for code and test conventions:

  • black for code style
  • isort for import sorting
  • pytest for running tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selfclean-0.0.31.tar.gz (108.6 kB view details)

Uploaded Source

Built Distribution

selfclean-0.0.31-py3-none-any.whl (172.3 kB view details)

Uploaded Python 3

File details

Details for the file selfclean-0.0.31.tar.gz.

File metadata

  • Download URL: selfclean-0.0.31.tar.gz
  • Upload date:
  • Size: 108.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.20

File hashes

Hashes for selfclean-0.0.31.tar.gz
Algorithm Hash digest
SHA256 8881ae96986d8753e790f24ea5649fff8f0f7e7fa1e1673828cca4470096f8a9
MD5 db9c33392dca7b596216d99a2dd1d0d9
BLAKE2b-256 6fd63d57f7f5a4f4f4ca714d1c0dbb0f3c5ff6283889484c8f8ad0a6c1097354

See more details on using hashes here.

File details

Details for the file selfclean-0.0.31-py3-none-any.whl.

File metadata

  • Download URL: selfclean-0.0.31-py3-none-any.whl
  • Upload date:
  • Size: 172.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.20

File hashes

Hashes for selfclean-0.0.31-py3-none-any.whl
Algorithm Hash digest
SHA256 7e375279be98c754633a53fed71eeeba1d62b7b8e419094ec389fc262f902d44
MD5 ea65f86b0501dd3d30e44566c65f6cdc
BLAKE2b-256 1de9dda926e9ba672b1076404518911fd9c63b40206f3ab594d51e630428806d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page