Skip to main content

The Data Detective, for better Machine Learning & AI

Project description

DataDetective

Install

pip install -U datadetective

How to use

DataDetective works with classifiers. It ranks the suspicious labels given probabilies by some classification model. You can use normal Python lists, Numpy arrays or Pandas data. Return values are in a Numpy array or a Pandas series, the larger the value, the more suspicious are the coresponding labels.

assert datadetective.__version__ == '0.4.0'
from datadetective import suspect
labels = pd.Series(["cat", "dog", "dog", "cat", "cat"])
0    cat
1    dog
2    dog
3    cat
4    cat
dtype: object
probas = pd.DataFrame(dict(
    cat=[0.5, 0.4, 0.3, 0.2, 0.1],
    dog=[0.5, 0.6, 0.7, 0.8, 0.9],
))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cat dog
0 0.5 0.5
1 0.4 0.6
2 0.3 0.7
3 0.2 0.8
4 0.1 0.9
suspect(
    probas,
    labels=labels,
)
datadetective.classification.estimate_noise.avg_confidence:35 [0.26666667 0.65      ]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
err suspected
0 0.000000 False
1 0.183333 True
2 0.000000 False
3 0.216667 True
4 0.416667 True
residual = suspect(
    probas,
    labels=labels,
    rank_method="residual",
    return_non_errors=False,
)
datadetective.classification.estimate_noise.avg_confidence:35 [0.26666667 0.65      ]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
err
1 0.4
3 0.8
4 0.9
set_logger("INFO")
confidence = suspect(
    probas,
    labels=labels,
    rank_method="confidence",
    return_non_errors=False,
)
datadetective.classification.estimate_noise.avg_confidence:35 [0.26666667 0.65      ]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
err
id
1 0.183333
3 0.216667
4 0.416667
probas.assign(labels=labels, residual=residual, confidence=confidence)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cat dog labels residual confidence
0 0.5 0.5 cat NaN NaN
1 0.4 0.6 dog 0.4 0.183333
2 0.3 0.7 dog NaN NaN
3 0.2 0.8 cat 0.8 0.216667
4 0.1 0.9 cat 0.9 0.416667

docstring

help(suspect)
Help on function suspect in module datadetective.api:

suspect(...)
    Rank the suspicious labels given probas from a classifier.
    Accept Numpy arrays, Pandas dataframes and series.
    We can use interger, string or even float labels, given that
    the probability matrix's columns are indexed by the same label set.
    
    #### Args
    
    - probas (n x m matrix): probabilites for possible classes.
    
    #### KwArgs
    
    - labels (n x 1 vector): observed class labels
    - rank_method (str): `residual` or `confidence`
    - return_non_errors (bool, default = True): return all rows, including non-errors
    
    #### Returns
    
    a Pandas DataFrame including 1 index and 2 columns:
    
    - id (int): the index which is the same to the original data row index
    - err (float): the magnitude of suspiciousness, valued between [0, 1]
    - suspected (bool):  whether the data row is suspected as having a label error. This collum is returned iff return_non_errors=True.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadetective-0.4.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadetective-0.4.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file datadetective-0.4.0.tar.gz.

File metadata

  • Download URL: datadetective-0.4.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for datadetective-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2a091041c323953bdef82cf3971bde896d075a3db60cd168d961c16d1fd77132
MD5 24d138fa67f09480479313ee8bab3f81
BLAKE2b-256 381ae893fe986749bb57cc3f028f168a995c6975ef9c2d496240344380b03781

See more details on using hashes here.

File details

Details for the file datadetective-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: datadetective-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for datadetective-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4c484f8184374a4830d1ba72b9b4089e3b602ee3ccd7f6961612e4b358b2a031
MD5 2d84e3666021a1f5665a4c03f13a7043
BLAKE2b-256 a2edeebcfbebab237a3882373745840b68f88b520a9296788a5aa026e50d716c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page