DataDetective
Install
pip install -U datadetective
How to use
DataDetective works with classifiers. It ranks the suspicious labels
given probabilies by some classification model. You can use normal
Python lists, Numpy arrays or Pandas data. Return values are in a Numpy
array or a Pandas series, the larger the value, the more suspicious are
the coresponding labels.
assert datadetective.__version__ == '0.4.0'
from datadetective import suspect
labels = pd.Series(["cat", "dog", "dog", "cat", "cat"])
0 cat
1 dog
2 dog
3 cat
4 cat
dtype: object
probas = pd.DataFrame(dict(
cat=[0.5, 0.4, 0.3, 0.2, 0.1],
dog=[0.5, 0.6, 0.7, 0.8, 0.9],
))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
|
cat |
dog |
0 |
0.5 |
0.5 |
1 |
0.4 |
0.6 |
2 |
0.3 |
0.7 |
3 |
0.2 |
0.8 |
4 |
0.1 |
0.9 |
suspect(
probas,
labels=labels,
)
datadetective.classification.estimate_noise.avg_confidence:35 [0.26666667 0.65 ]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
|
err |
suspected |
0 |
0.000000 |
False |
1 |
0.183333 |
True |
2 |
0.000000 |
False |
3 |
0.216667 |
True |
4 |
0.416667 |
True |
residual = suspect(
probas,
labels=labels,
rank_method="residual",
return_non_errors=False,
)
datadetective.classification.estimate_noise.avg_confidence:35 [0.26666667 0.65 ]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
set_logger("INFO")
confidence = suspect(
probas,
labels=labels,
rank_method="confidence",
return_non_errors=False,
)
datadetective.classification.estimate_noise.avg_confidence:35 [0.26666667 0.65 ]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
|
err |
id |
|
1 |
0.183333 |
3 |
0.216667 |
4 |
0.416667 |
probas.assign(labels=labels, residual=residual, confidence=confidence)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
|
cat |
dog |
labels |
residual |
confidence |
0 |
0.5 |
0.5 |
cat |
NaN |
NaN |
1 |
0.4 |
0.6 |
dog |
0.4 |
0.183333 |
2 |
0.3 |
0.7 |
dog |
NaN |
NaN |
3 |
0.2 |
0.8 |
cat |
0.8 |
0.216667 |
4 |
0.1 |
0.9 |
cat |
0.9 |
0.416667 |
docstring
help(suspect)
Help on function suspect in module datadetective.api:
suspect(...)
Rank the suspicious labels given probas from a classifier.
Accept Numpy arrays, Pandas dataframes and series.
We can use interger, string or even float labels, given that
the probability matrix's columns are indexed by the same label set.
#### Args
- probas (n x m matrix): probabilites for possible classes.
#### KwArgs
- labels (n x 1 vector): observed class labels
- rank_method (str): `residual` or `confidence`
- return_non_errors (bool, default = True): return all rows, including non-errors
#### Returns
a Pandas DataFrame including 1 index and 2 columns:
- id (int): the index which is the same to the original data row index
- err (float): the magnitude of suspiciousness, valued between [0, 1]
- suspected (bool): whether the data row is suspected as having a label error. This collum is returned iff return_non_errors=True.