Find label errors in datasets, weak supervision, and learning with noisy labels. Works for all datasets and models.
Project description
cleanlab is a machine learning python package for learning with noisy labels and finding label errors in datasets. cleanlab CLEANs LABels. It is powered by the theory of confident learning, published in this paper and explained in this blog. Using the confidentlearningreproduce repo, cleanlab v0.1.0 reproduces results in the CL paper.
cleanlab documentation is available in this blog post.
Past release notes and future features planned is available here.
So fresh, so cleanlab
cleanlab finds and cleans label errors in any dataset using stateoftheart algorithms to find label errors, characterize noise, and learn in spite of it. cleanlab is fast: its built on optimized algorithms and parallelized across CPU threads automatically. cleanlab is powered by provable guarantees of exact noise estimation and label error finding in realistic cases when model output probabilities are erroneous. cleanlab supports multilabel, multiclass, sparse matrices, etc. By default, cleanlab requires no hyperparameters.
cleanlab finds and cleans label errors in any dataset using stateoftheart algorithms for learning with noisy labels by characterizing label noise. cleanlab is fast: its built on optimized algorithms and parallelized across CPU threads automatically. cleanlab implements the family of theory and algorithms called confident learning with provable guarantees of exact noise estimation and label error finding (even when model output probabilities are noisy/imperfect).
How does confident learning work? See: TUTORIAL: confident learning with just numpy and forloops.
cleanlab supports multilabel, multiclass, sparse matrices, and more.
cleanlab is:
 fast  Singleshot, noniterative, parallelized algorithms (e.g. < 1 second to find label errors in ImageNet)
 robust  Provable generalization and risk minimimzation guarantees, including imperfect probability estimation.
 general  Works with any probablistic classifier: PyTorch, Tensorflow, MxNet, Caffe2, scikitlearn, etc.
 unique  The only package for multiclass learning with noisy labels or finding label errors for any dataset / classifier.
Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code.
# Compute psx (n x m matrix of predicted probabilities) on your own, with any classifier. # Be sure you compute probs in a holdout/outofsample manner (e.g. crossvalidation) # Now getting label errors is trivial with cleanlab... its one line of code. # Label errors are ordered by likelihood of being an error. First index is most likely error. from cleanlab.pruning import get_noise_indices ordered_label_errors = get_noise_indices( s=numpy_array_of_noisy_labels, psx=numpy_array_of_predicted_probabilities, sorted_index_method='normalized_margin', # Orders label errors )
Precomputed outofsample predicted probabilities for CIFAR10 train set are available here: [LINK].
Learning with noisy labels in 3 lines of code!
from cleanlab.classification import LearningWithNoisyLabels from sklearn.linear_model import LogisticRegression # Wrap around any classifier. Yup, you can use sklearn/pyTorch/Tensorflow/FastText/etc. lnl = LearningWithNoisyLabels(clf=LogisticRegression()) lnl.fit(X=X_train_data, s=train_noisy_labels) # Estimate the predictions you would have gotten by training with *no* label errors. predicted_test_labels = lnl.predict(X_test)
Check out these examples and tests (includes how to use pyTorch, FastText, etc.).
Installation
Python 2.7, 3.4, 3.5, and 3.6 are supported.
Stable release:
$ pip install cleanlab
Developer (unstable) release:
$ pip install git+https://github.com/cgnorthcutt/cleanlab.git
To install the codebase (enabling you to make modifications):
$ conda update pip # if you use conda $ git clone https://github.com/cgnorthcutt/cleanlab.git $ cd cleanlab $ pip install e .
Reproducing Results in confident learning paper
See cleanlab/examples. You’ll need to git clone confidentlearningreproduce which contains the data and files needed to reproduce the CIFAR10 results.
cleanlab: Find Label Errors in ImageNet
Use cleanlab to identify ~100,000 label errors in the 2012 ImageNet training dataset.
Top label issues in the 2012 ILSVRC ImageNet train set identified using cleanlab. Label Errors are boxed in red. Ontological issues in green. Multilabel images in blue.
cleanlab: Find Label Errors in MNIST
Use cleanlab to identify ~50 label errors in the MNIST dataset.
Label errors of the original MNIST train dataset identified algorithmically using cleanlab. Depicts the 24 least confident labels, ordered leftright, topdown by increasing selfconfidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.
cleanlab Generality: View performance across 4 distributions and 9 classifiers.
Use cleanlab to learn with noisy labels regardless of dataset distribution or classifier.
Each subfigure in the figure above depicts the decision boundary learned using cleanlab.classification.LearningWithNoisyLabels in the presence of extreme (~35%) label errors. Label errors are circled in green. Label noise is classconditional (not simply uniformly random). Columns are organized by the classifier used, except the leftmost column which depicts the groundtruth dataset distribution. Rows are organized by dataset used.
The code to reproduce this figure is available here.
Each figure depicts accuracy scores on a test set as decimal values:
 LEFT (in black): The classifier test accuracy trained with perfect labels (no label errors).
 MIDDLE (in blue): The classifier test accuracy trained with noisy labels using cleanlab.
 RIGHT (in white): The baseline classifier test accuracy trained with noisy labels.
As an example, this is the noise matrix (noisy channel) P(s  y) characterizing the label noise for the first dataset row in the figure. s represents the observed noisy labels and y represents the latent, true labels. The trace of this matrix is 2.6. A trace of 4 implies no label noise. A cell in this matrix is read like, “A random 38% of ‘3’ labels were flipped to ‘2’ labels.”
p(sy)  y=0  y=1  y=2  y=3 

s=0  0.55  0.01  0.07  0.06 
s=1  0.22  0.87  0.24  0.02 
s=2  0.12  0.04  0.64  0.38 
s=3  0.11  0.08  0.05  0.54 
Get started with easy, quick examples.
New to cleanlab? Start with:
 Visualizing confident learning
 A simple example of learning with noisy labels on the multiclass Iris dataset.
These examples show how easy it is to characterize label noise in datasets, learn with noisy labels, identify label errors, estimate latent priors and noisy channels, and more.
Use cleanlab with any model (Tensorflow, caffe2, PyTorch, etc.)
All of the features of the cleanlab package work with any model. Yes, any model. Feel free to use PyTorch, Tensorflow, caffe2, scikitlearn, mxnet, etc. If you use a scikitlearn classifier, all cleanlab methods will work outofthebox. It’s also easy to use your favorite model from a nonscikitlearn package, just wrap your model into a Python class that inherits the sklearn.base.BaseEstimator:
from sklearn.base import BaseEstimator class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier def __init__(self, ): pass def fit(self, X, y, sample_weight=None): pass def predict(self, X): pass def predict_proba(self, X): pass def score(self, X, y, sample_weight=None): pass # Now you can use your model with `cleanlab`. Here's one example: from cleanlab.classification import LearningWithNoisyLabels lnl = LearningWithNoisyLabels(clf=YourFavoriteModel()) lnl.fit(train_data, train_labels_with_errors)
Want to see a working example? Here’s a compliant PyTorch MNIST CNN class
As you can see here, technically you don’t actually need to inherit from sklearn.base.BaseEstimator, as you can just create a class that defines .fit(), .predict(), and .predict_proba(), but inheriting makes downstream scikitlearn applications like hyperparameter optimization work seamlessly. For example, the LearningWithNoisyLabels() model is fully compliant.
Note, some libraries exists to do this for you. For pyTorch, check out the skorch Python library which will wrap your pytorch model into a scikitlearn compliant model.
Documentation by Example
cleanlab Core Package Components
 cleanlab/classification.py  The LearningWithNoisyLabels() class for learning with noisy labels.
 cleanlab/latent_algebra.py  Equalities when noise information is known.
 cleanlab/latent_estimation.py  Estimates and fully characterizes all variants of label noise.
 cleanlab/noise_generation.py  Generate mathematically valid synthetic noise matrices.
 cleanlab/polyplex.py  Characterizes joint distribution of label noise EXACTLY from noisy channel.
 cleanlab/pruning.py  Finds the indices of the examples with label errors in a dataset.
Many of these methods have default parameters that won’t be covered here. Check out the method docstrings for full documentation.
Estimate the confident joint, the latent noisy channel matrix, P(s  y) and inverse, P(y  s), the latent prior of the unobserved, actual true labels, p(y), and the predicted probabilities.
s denotes a random variable that represents the observed, noisy label and y denotes a random variable representing the hidden, actual labels. Both s and y take any of the m classes as values. The cleanlab package supports different levels of granularity for computation depending on the needs of the user. Because of this, we support multiple alternatives, all no more than a few lines, to estimate these latent distribution arrays, enabling the user to reduce computation time by only computing what they need to compute, as seen in the examples below.
Throughout these examples, you’ll see a variable called confident_joint. The confident joint is an m x m matrix (m is the number of classes) that counts, for every observed, noisy class, the number of examples that confidently belong to every latent, hidden class. It counts the number of examples that we are confident are labeled correctly or incorrectly for every pair of obseved and unobserved classes. The confident joint is an unnormalized estimate of the completeinformation latent joint distribution, Ps,y. Most of the methods in the cleanlab package start by first estimating the confident_joint. You can learn more about this in the confident learning paper.
Option 1: Compute the confident joint and predicted probs first. Stop if that’s all you need.
from cleanlab.latent_estimation import estimate_latent from cleanlab.latent_estimation import estimate_confident_joint_and_cv_pred_proba # Compute the confident joint and the n x m predicted probabilities matrix (psx), # for n examples, m classes. Stop here if all you need is the confident joint. confident_joint, psx = estimate_confident_joint_and_cv_pred_proba( X=X_train, s=train_labels_with_errors, clf=logreg(), # default, you can use any classifier ) # Estimate latent distributions: p(y) as est_py, P(sy) as est_nm, and P(ys) as est_inv est_py, est_nm, est_inv = estimate_latent(confident_joint, s=train_labels_with_errors)
Option 2: Estimate the latent distribution matrices in a single line of code.
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba( X=X_train, s=train_labels_with_errors, )
Option 3: Skip computing the predicted probabilities if you already have them.
# Already have psx? (n x m matrix of predicted probabilities) # For example, you might get them from a pretrained model (like resnet on ImageNet) # With the cleanlab package, you estimate directly with psx. from cleanlab.latent_estimation import estimate_py_and_noise_matrices_from_probabilities est_py, est_nm, est_inv, confident_joint = estimate_py_and_noise_matrices_from_probabilities( s=train_labels_with_errors, psx=psx, )
Completely characterize label noise in a dataset:
The joint probability distribution of noisy and true labels, P(s,y), completely characterizes label noise with a classconditional m x m matrix.
from cleanlab.latent_estimation import estimate_joint joint = compute_confident_joint( s=noisy_labels, psx=probabilities, confident_joint=None, # Provide if you have it already )
Methods to Standardize Research with Noisy Labels
cleanlab supports a number of functions to generate noise for benchmarking and standardization in research. This next example shows how to generate valid, classconditional, unformly random noisy channel matrices:
# Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1 from cleanlab.noise_generation import generate_noise_matrix_from_trace noise_matrix=generate_noise_matrix_from_trace( K=number_of_classes, trace=float_value_greater_than_1_and_leq_K, py=prior_of_y_actual_labels_which_is_just_an_array_of_length_K, frac_zero_noise_rates=float_from_0_to_1_controlling_sparsity, ) # Check if a noise matrix is valid (necessary conditions for learnability are met) from cleanlab.noise_generation import noise_matrix_is_valid is_valid=noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
For a given noise matrix, this example shows how to generate noisy labels. Methods can be seeded for reproducibility.
# Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels. from cleanlab.noise_generation import generate_noisy_labels s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix) # This package is a full of other useful methods for learning with noisy labels. # The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
The Polyplex
The key to learning in the presence of label errors is estimating the joint distribution between the actual, hidden labels ‘y’ and the observed, noisy labels ‘s’. Using cleanlab and the theory of confident learning, we can completely characterize the trace of the latent joint distribution, trace(P(s,y)), given p(y), for any fraction of label errors, i.e. for any trace of the noisy channel, trace(P(sy)).
You can check out how to do this yourself here: 1. Drawing Polyplices 2. Computing Polyplices
License
Copyright (c) 20172019 Curtis Northcutt. Released under the MIT License. See LICENSE for details.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size cleanlab0.1.1py2.py3noneany.whl (58.8 kB)  File type Wheel  Python version py2.py3  Upload date  Hashes View 
Filename, size cleanlab0.1.1.tar.gz (58.4 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for cleanlab0.1.1py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  ed93019d0e25c221307acacf30d11f2d048f8092bcc2b3613e917bcbb3578359 

MD5  e2ab59d6e2dee3c74881bc671241d380 

BLAKE2256  99a6980c7e765ee40222b35171e4364ef0bdb1d9c6b99a478d797d7843cf31e9 