Skip to main content

Don't Blindly Trust Your Labels

Project description

GitHub - License PyPI - Python Version PyPI - Package Version Conda - Platform Conda (channel only) Docs - GitHub.io

doubtlab

A lab for bad labels. Learn more here.

This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.

Install

You can install the tool via pip or conda.

Install with pip

python -m pip install doubtlab

Install with conda

conda install -c conda-forge doubtlab

Quickstart

Doubtlab allows you to define "reasons" for a row of data to deserve another look. These reasons can form a pipeline which can be used to retreive a sorted list of examples worth checking again.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason

# Let's say we have some dataset/model already
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1_000)
model.fit(X, y)

# Next we can add reasons for doubt. In this case we're saying
# that examples deserve another look if the associated proba values
# are low or if the model output doesn't match the associated label.
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model)
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtEnsemble(**reasons)

# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)
# Get dataframe with "reason"-ing behind the sorting
predicates = doubt.get_predicates(X, y)

Features

The library implemented many "reasons" for doubt.

General Reasons

  • RandomReason: assign doubt randomly, just for sure
  • OutlierReason: assign doubt when the model declares a row an outlier

Classification Reasons

  • ProbaReason: assign doubt when a models' confidence-values are low for any label
  • WrongPredictionReason: assign doubt when a model cannot predict the listed label
  • ShortConfidenceReason: assign doubt when the correct label gains too little confidence
  • LongConfidenceReason: assign doubt when a wrong label gains too much confidence
  • DisagreeReason: assign doubt when two models disagree on a prediction
  • CleanlabReason: assign doubt according to cleanlab
  • MarginConfidenceReason: assign doubt when there's a small difference between the top two class confidences

Regression Reasons

  • AbsoluteDifferenceReason: assign doubt when the absolute difference is too high
  • RelativeDifferenceReason: assign doubt when the relative difference is too high
  • StandardizedErrorReason: assign doubt when the absolute standardized residual is too high

Feedback

It is early days for the project. The project should be plenty useful as-is, but we prefer to be honest. Feedback and anecdotes are very welcome!

Related Projects

  • The cleanlab project was an inspiration for this one. They have a great heuristic for bad label detection but I wanted to have a library that implements many. Be sure to check out their work on the labelerrors.com project.
  • My former employer, Rasa, has always had a focus on data quality. Some of that attitude is bound to have seeped in here. Be sure to check the Conversation Driven Development approach and Rasa X if you're working on virtual assistants.
  • My current employer, Explosion, has a neat labelling tool called Prodigy. I'm currently investigating how tools like doubtlab might lead to better labels when combined with this (very like-able) annotation tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doubtlab-0.2.4.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

doubtlab-0.2.4-py2.py3-none-any.whl (11.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file doubtlab-0.2.4.tar.gz.

File metadata

  • Download URL: doubtlab-0.2.4.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.11.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.9

File hashes

Hashes for doubtlab-0.2.4.tar.gz
Algorithm Hash digest
SHA256 ab23fc36548a970fe52caea26c8c626a0d30bad78b44889e208f7dad455e7642
MD5 3be9e22c0929ab8b650590e12401db27
BLAKE2b-256 10eb0fe5d67f9ce603488f25b54e8eb0591116848da5dc51c58ae84e5d156ddb

See more details on using hashes here.

File details

Details for the file doubtlab-0.2.4-py2.py3-none-any.whl.

File metadata

  • Download URL: doubtlab-0.2.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.11.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.9

File hashes

Hashes for doubtlab-0.2.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d22844e6673ff7bf41f3b0b9b69cd9c4d364347a8c74a2a3077d4dcc96ca3565
MD5 1e551d6f9b1cf5158b2ffc6e27c6f2c4
BLAKE2b-256 580bfbc7328db09fa18a43e570404ce70a806a0219b6d56f381543eedd13b2f9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page