The standard package for datacentric AI, machine learning with label errors, and automatically finding and fixing dataset issues in Python.
Project description
cleanlab automatically finds and fixes errors in any ML dataset. This datacentric AI package facilitates machine learning with messy, realworld data by providing clean labels during training.
# Cleanlab works with **any classifier**. Yup, you can use sklearn/PyTorch/TensorFlow/XGBoost/etc. cl = cleanlab.classification.CleanLearning(sklearn.YourFavoriteClassifier()) # cleanlab finds data and label issues in **any dataset**... in ONE line of code! label_issues = cl.find_label_issues(data, labels) # cleanlab trains a robust version of your model that works more reliably with noisy data. cl.fit(data, labels) # cleanlab estimates the predictions you would have gotten if you had trained with *no* label issues. cl.predict(test_data) # A true datacentric AI package, cleanlab quantifies classlevel issues and overall data quality, for any dataset. cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)
Get started with: documentation, tutorials, examples, and blogs.
 Learn how to run cleanlab on your own data in just 5 minutes!
 Quickstart with 5minute tutorials for classification with: image, text, audio, and tabular data.
News! (2022)  cleanlab made accessible for everybody, not just ML researchers (click to learn more)
 April 2022 📖 cleanlab 2.0.0 released! Lays foundations for this library to grow into a generalpurpose datacentric AI toolkit.
 March 2022 📖 Documentation migrated to new website: docs.cleanlab.ai with quickstart tutorials for image/text/audio/tabular data.
 Feb 2022 💻 APIs simplified to make cleanlab accessible for everybody, not just ML researchers
News! (2021)  cleanlab finds pervasive label errors in the most common ML datasets (click to learn more)
 Dec 2021 🎉 NeurIPS published the label errors paper (Northcutt, Athalye, & Mueller, 2021).
 Apr 2021 🎉 Journal of AI Research published the confident learning paper (Northcutt, Jiang, & Chuang, 2021).
 Mar 2021 😲 cleanlab used to find and fix label issues in 10 of the most common ML benchmark datasets, published in: NeurIPS 2021. Along with the paper (Northcutt, Athalye, & Mueller, 2021), the authors launched labelerrors.com where you can view the label issues in these datasets.
News! (2020)  cleanlab adds support for all OS, achieves stateoftheart, supports coteaching and more (click to learn more)
 Dec 2020 🎉 cleanlab supports NeurIPS workshop paper (Northcutt, Athalye, & Lin, 2020).
 Dec 2020 🤖 cleanlab supports PU learning.
 Feb 2020 🤖 cleanlab now natively supports Mac, Linux, and Windows.
 Feb 2020 🤖 cleanlab now supports CoTeaching (Han et al., 2018).
 Jan 2020 🎉 cleanlab achieves stateoftheart on CIFAR10 with noisy labels. Code to reproduce: examples/cifar10. This is a great place to see how to use cleanlab on real datasets (with predicted probabilities from trained model already precomputed for you).
Release notes for past versions are available here. Details behind certain updates are explained in our blog.
Longtime cleanlab user?
 Here's a guide on how to migrate to cleanlab 2.0.0.
So fresh, so cleanlab
cleanlab cleans your data's labels via stateoftheart confident learning algorithms, published in this paper and blog. See datasets cleaned with cleanlab at labelerrors.com. This package helps you find all the label issues lurking in your data and train more reliable ML models.
cleanlab is:
 backed by theory
 with provable guarantees of exact noise estimation and label error finding in realistic cases with imperfect models.
 fast
 Code is optimized and parallelthreaded (< 1 second to find label issues in ImageNet with precomputed probabilities).
 easytouse
 Find label issues or train noiserobust models in one line of code. By default, cleanlab requires no hyperparameters.
 general
 Works with any dataset and any model, e.g., TensorFlow, PyTorch, sklearn, xgboost, etc.
Examples of incorrect given labels in various image datasets found and corrected using cleanlab.
Run cleanlab
cleanlab supports Linux, macOS, and Windows and runs on Python 3.6+.
 Get started here! Install via
pip
orconda
as described here.  Developers who install the bleedingedge master branch from source should refer to this master version of documentation.
cleanlab core package components (click to learn more)
Many methods have default parameters not covered here. Check out the documentation for the master branch version
cleanlab Core Package Components
 cleanlab/classification.py  CleanLearning() class for learning with noisy labels.
 cleanlab/count.py  Estimates and fully characterizes all variants of label noise.
 cleanlab/filter.py  Finds the examples with label issues in a dataset.
 cleanlab/rank.py  Rank every example in a dataset with various label quality scores.
 cleanlab.dataset.py  Provides datasetlevel and classlevel overviews of issues in your dataset.
 cleanlab/benchmarking/noise_generation.py  Generate noisy labels for benchmarking, reproduction, and ML research.
Use cleanlab with any model (TensorFlow, PyTorch, sklearn, xgboost, etc.)
All features of cleanlab work with any dataset and any model. Yes, any model: scikitlearn, PyTorch, Tensorflow, Keras, JAX, HuggingFace, MXNet, XGBoost, etc. If you use a sklearncompatible classifier, cleanlab methods work outofthebox.
It’s also easy to use your favorite nonsklearncompatible model (click to learn more)
There's nothing you need to do if your model already has .fit()
, .predict()
, and .predict_proba()
methods.
Otherwise, just wrap your custom model into a Python class that inherits the sklearn.base.BaseEstimator
:
from sklearn.base import BaseEstimator class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier def __init__(self, ): pass # ensure this reinitializes parameters for neural net models def fit(self, X, y, sample_weight=None): pass def predict(self, X): pass def predict_proba(self, X): pass def score(self, X, y, sample_weight=None): pass
This inheritance allows to apply a wide range of sklearn functionality like hyperparameteroptimization to your custom model. Now you can use your model with every method in cleanlab. Here's one example:
from cleanlab.classification import CleanLearning cl = CleanLearning(clf=YourFavoriteModel()) # has all the same methods of YourFavoriteModel cl.fit(train_data, train_labels_with_errors) cl.predict(test_data)
Want to see a working example? Here’s a compliant PyTorch MNIST CNN class
More details are provided in documentation of cleanlab.classification.CleanLearning.
Note, some libraries exist to give you sklearncompatibility for free. For PyTorch, check out the skorch Python library which will wrap your PyTorch model into a sklearncompatible model (example). For TensorFlow/Keras, check out SciKeras (example). Many libraries also already offer a special scikitlearn API, for example: XGBoost or LightGBM.
Cool cleanlab applications
Reproducing results in Confident Learning paper (click to learn more)
For additional details, check out the: confidentlearningreproduce repository.
State of the Art Learning with Noisy Labels in CIFAR
A stepbystep guide to reproduce these results is available here. This guide is also a good tutorial for using cleanlab on any large dataset. You'll need to git clone
confidentlearningreproduce which contains the data and files needed to reproduce the CIFAR10 results.
Comparison of confident learning (CL), as implemented in cleanlab, versus seven recent methods for learning with noisy labels in CIFAR10. Highlighted cells show CL robustness to sparsity. The five CL methods estimate label issues, remove them, then train on the cleaned data using CoTeaching.
Observe how cleanlab (i.e. the CL method) is robust to large sparsity in label noise whereas prior art tends to reduce in performance for increased sparsity, as shown by the red highlighted regions. This is important because realworld label noise is often sparse, e.g. a tiger is likely to be mislabeled as a lion, but not as most other classes like airplane, bathtub, and microwave.
Find label issues in ImageNet
Use cleanlab to identify ~100,000 label errors in the 2012 ILSVRC ImageNet training dataset: examples/imagenet.
Label issues in ImageNet train set found via cleanlab. Label Errors are boxed in red. Ontological issues in green. Multilabel images in blue.
Find Label Errors in MNIST
Use cleanlab to identify ~50 label errors in the MNIST dataset: examples/mnist.
Top 24 leastconfident labels in the original MNIST train dataset, algorithmically identified via cleanlab. Examples are ordered leftright, topdown by increasing selfconfidence (predicted probability that the given label is correct), denoted conf in teal. The mostlikely correct label (with largest predicted probability) is in green. Overt label errors highlighted in red.
cleanlab performance across 4 data distributions and 9 classifiers (click to learn more)
cleanlab is a general tool that can learn with noisy labels regardless of dataset distribution or classifier type: examples/classifier_comparison.
Each subfigure above depicts the decision boundary learned using cleanlab.classification.CleanLearning in the presence of extreme (~35%) label errors (circled in green). Label noise is classconditional (not uniformly random). Columns are organized by the classifier used, except the leftmost column which depicts the groundtruth data distribution. Rows are organized by dataset.
Each subfigure depicts accuracy scores on a test set (with correct nonnoisy labels) as decimal values:
 LEFT (in black): The classifier test accuracy trained with perfect labels (no label errors).
 MIDDLE (in blue): The classifier test accuracy trained with noisy labels using cleanlab.
 RIGHT (in white): The baseline classifier test accuracy trained with noisy labels.
As an example, the table below is the noise matrix (noisy channel) *P(s  y) characterizing the label noise for the first dataset row in the figure. s represents the observed noisy labels and y represents the latent, true labels. The trace of this matrix is 2.6. A trace of 4 implies no label noise. A cell in this matrix is read like: "Around 38% of true underlying '3' labels were randomly flipped to '2' labels in the observed dataset."
p(label︱y) 
y=0  y=1  y=2  y=3 

label=0  0.55  0.01  0.07  0.06 
label=1  0.22  0.87  0.24  0.02 
label=2  0.12  0.04  0.64  0.38 
label=3  0.11  0.08  0.05  0.54 
ML research using cleanlab (click to learn more)
Researchers may find some components of this package useful for evaluating algorithms for ML with noisy labels. For additional details/notation, refer to the Confident Learning paper.
Methods to Standardize Research with Noisy Labels
cleanlab supports a number of functions to generate noise for benchmarking and standardization in research. This next example shows how to generate valid, classconditional, uniformly random noisy channel matrices:
# Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1 from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace noise_matrix=generate_noise_matrix_from_trace( K=number_of_classes, trace=float_value_greater_than_1_and_leq_K, py=prior_of_y_actual_labels_which_is_just_an_array_of_length_K, frac_zero_noise_rates=float_from_0_to_1_controlling_sparsity, ) # Check if a noise matrix is valid (necessary conditions for learnability are met) from cleanlab.benchmarking.noise_generation import noise_matrix_is_valid is_valid=noise_matrix_is_valid( noise_matrix, prior_of_y_which_is_just_an_array_of_length_K, )
For a given noise matrix, this example shows how to generate noisy labels. Methods can be seeded for reproducibility.
# Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels. from cleanlab.benchmarking.noise_generation import generate_noisy_labels s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix) # This package is a full of other useful methods for learning with noisy labels. # The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
cleanlab for advanced users (click to learn more)
Many methods and their default parameters are not covered here. Check out the documentation for the master branch version for the full suite of features supported by the cleanlab API.
Use any custom model's predicted probabilities to find label errors in 1 line of code
pred_probs (num_examples x num_classes matrix of predicted probabilities) should already be computed on your own, with any classifier. pred_probs must be obtained in a holdout/outofsample manner (e.g. via crossvalidation).
 cleanlab can do this for you via
cleanlab.count.estimate_cv_predicted_probabilities
]  Tutorial with more info: [here]
 Examples how to compute pred_probs with: [CNN image classifier (PyTorch)], [NN text classifier (TensorFlow)]
# label issues are ordered by likelihood of being an error. First index is most likely error. from cleanlab.filter import find_label_issues ordered_label_issues = find_label_issues( # One line of code! labels=numpy_array_of_noisy_labels, pred_probs=numpy_array_of_predicted_probabilities, return_indices_ranked_by='normalized_margin', # Orders label issues )
Precomputed outofsample predicted probabilities for CIFAR10 train set are available: here.
Fully characterize label noise and uncertainty in your dataset.
s denotes a random variable that represents the observed, noisy label and y denotes a random variable representing the hidden, actual labels. Both s and y take any of the m classes as values. The cleanlab package supports different levels of granularity for computation depending on the needs of the user. Because of this, we support multiple alternatives, all no more than a few lines, to estimate these latent distribution arrays, enabling the user to reduce computation time by only computing what they need to compute, as seen in the examples below.
Throughout these examples, you’ll see a variable called confident_joint. The confident joint is an m x m matrix (m is the number of classes) that counts, for every observed, noisy class, the number of examples that confidently belong to every latent, hidden class. It counts the number of examples that we are confident are labeled correctly or incorrectly for every pair of observed and unobserved classes. The confident joint is an unnormalized estimate of the completeinformation latent joint distribution, Ps,y.
The label flipping rates are denoted P(s  y), the inverse rates are P(y  s), and the latent prior of the unobserved, true labels, p(y).
Most of the methods in the cleanlab package start by first estimating the confident_joint. You can learn more about this in the confident learning paper.
Option 1: Compute the confident joint and predicted probs first. Stop if that’s all you need.
from cleanlab.count import estimate_latent from cleanlab.count import estimate_confident_joint_and_cv_pred_proba # Compute the confident joint and the n x m predicted probabilities matrix (pred_probs), # for n examples, m classes. Stop here if all you need is the confident joint. confident_joint, pred_probs = estimate_confident_joint_and_cv_pred_proba( X=X_train, labels=train_labels_with_errors, clf=logreg(), # default, you can use any classifier ) # Estimate latent distributions: p(y) as est_py, P(sy) as est_nm, and P(ys) as est_inv est_py, est_nm, est_inv = estimate_latent( confident_joint, labels=train_labels_with_errors, )
Option 2: Estimate the latent distribution matrices in a single line of code.
from cleanlab.count import estimate_py_noise_matrices_and_cv_pred_proba est_py, est_nm, est_inv, confident_joint, pred_probs = estimate_py_noise_matrices_and_cv_pred_proba( X=X_train, labels=train_labels_with_errors, )
Option 3: Skip computing the predicted probabilities if you already have them.
# Already have pred_probs? (n x m matrix of predicted probabilities) # For example, you might get them from a pretrained model (like resnet on ImageNet) # With the cleanlab package, you estimate directly with pred_probs. from cleanlab.count import estimate_py_and_noise_matrices_from_probabilities est_py, est_nm, est_inv, confident_joint = estimate_py_and_noise_matrices_from_probabilities( labels=train_labels_with_errors, pred_probs=pred_probs, )
Completely characterize label noise in a dataset:
The joint probability distribution of noisy and true labels, P(s,y), completely characterizes label noise with a classconditional m x m matrix.
from cleanlab.count import estimate_joint joint = estimate_joint( labels=noisy_labels, pred_probs=probabilities, confident_joint=None, # Provide if you have it already )
PositiveUnlabeled learning with cleanlab (click to learn more)
PositiveUnlabeled (PU) learning (in which your data only contains a few positively labeled examples with the rest unlabeled) is just a special case of CleanLearning when one of the classes has no error. P
stands for the positive class and is assumed to have zero label errors and U
stands for unlabeled data, but in practice, we just assume the U
class is a noisy negative class that actually contains some positive examples. Thus, the goal of PU learning is to (1) estimate the proportion of negatively labeled examples that actually belong to the positive class (seefraction\_noise\_in\_unlabeled\_class
in the last example), (2) find the errors (see last example), and (3) train on clean data (see first example below). cleanlab does all three, taking into account that there are no label errors in whichever class you specify as positive.
There are two ways to use cleanlab for PU learning. We'll look at each here.
Method 1. If you are using the cleanlab classifier CleanLearning(), and your dataset has exactly two classes (positive = 1, and negative = 0), PU learning is supported directly in cleanlab. You can perform PU learning like this:
from cleanlab.classification import CleanLearning from sklearn.linear_model import LogisticRegression # Wrap around any classifier. Yup, you can use sklearn/pyTorch/TensorFlow/FastText/etc. pu_class = 0 # Should be 0 or 1. Label of class with NO ERRORS. (e.g., P class in PU) cl = CleanLearning(clf=LogisticRegression(), pulearning=pu_class) cl.fit(X=X_train_data, labels=train_noisy_labels) # Estimate the predictions you would have gotten by training with *no* label errors. predicted_test_labels = cl.predict(X_test)
Method 2. However, you might be using a more complicated classifier that doesn't work well with CleanLearning (see this example for CIFAR10). Or you might have 3 or more classes. Here's how to use cleanlab for PU learning in this situation. To let cleanlab know which class has no error (in standard PU learning, this is the P class), you need to set the threshold for that class to 1 (1 means the probability that the labels of that class are correct is 1, i.e. that class has no error). Here's the code:
import numpy as np # K is the number of classes in your dataset # pred_probs are the crossvalidated predicted probabilities. # s is the array/list/iterable of noisy labels # pu_class is a 0based integer for the class that has no label errors. thresholds = np.asarray([np.mean(pred_probs[:, k][s == k]) for k in range(K)]) thresholds[pu_class] = 1.0
Now you can use cleanlab however you were before. Just be sure to pass in thresholds
as a parameter wherever it applies. For example:
# Uncertainty quantification (characterize the label noise # by estimating the joint distribution of noisy and true labels) cj = compute_confident_joint(s, pred_probs, thresholds=thresholds, ) # Now the noise (cj) has been estimated taking into account that some class(es) have no error. # We can use cj to find label errors like this: indices_of_label_issues = find_label_issues(s, pred_probs, confident_joint=cj, ) # In addition to label issues, cleanlab can find the fraction of noise in the unlabeled class. # First we need the inv_noise_matrix which contains P(ys) (proportion of mislabeling). _, _, inv_noise_matrix = estimate_latent(confident_joint=cj, labels=s, ) # Because inv_noise_matrix contains P(ys), p (y = anything  labels = pu_class) should be 0 # because the prob(true label is something else  example is in pu_class) is 0. # What's more interesting is p(y = anything  s is not put_class), or in the binary case # this translates to p(y = pu_class  s = 1  pu_class) because pu_class is 0 or 1. # So, to find the fraction_noise_in_unlabeled_class, for binary, you just compute: fraction_noise_in_unlabeled_class = inv_noise_matrix[pu_class][1  pu_class]
Now that you have indices_of_label_errors
, you can remove those label issues and train on clean data (or only remove some of the label issues and iteratively use confident learning / cleanlab to improve results).
Citation and related publications
cleanlab is based on peerreviewed research. Here are the relevant papers to cite if you use this package:
Confident Learning (JAIR '21) (click to show bibtex)
@article{northcutt2021confidentlearning,
title={Confident Learning: Estimating Uncertainty in Dataset Labels},
author={Curtis G. Northcutt and Lu Jiang and Isaac L. Chuang},
journal={Journal of Artificial Intelligence Research (JAIR)},
volume={70},
pages={13731411},
year={2021}
}
Rank Pruning (UAI '17) (click to show bibtex)
@inproceedings{northcutt2017rankpruning,
author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.},
title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels},
booktitle = {Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence},
series = {UAI'17},
year = {2017},
location = {Sydney, Australia},
numpages = {10},
url = {http://auai.org/uai2017/proceedings/papers/35.pdf},
publisher = {AUAI Press},
}
Other resources
Join our community

The best place to learn is our Slack community.

Have ideas for the future of cleanlab? How are you using cleanlab? Join the discussion.

Have code improvements for cleanlab? See the development guide and submit a pull request.

Have an issue with cleanlab? Search existing issues or submit a new issue.
License
Copyright (c) 20172022 Cleanlab Inc.
cleanlab is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
cleanlab is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See GNU Affero General Public LICENSE for details.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cleanlab2.0.0py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  4c1314c2fcaac181a4343984bdc8a91366ae6167120b660595d5b734aef9e872 

MD5  e702793c33c29d6ad4f65cf9cf934c00 

BLAKE2256  e4bab040cab0e7b6d84614b934650fe34de8346158ccfb925577db6316d4496a 