Calculate confusion matrix metrics from your pandas DataFrame
Project description
disarray
disarray
calculates metrics derived from a confusion matrix and makes them directly accessible from a pandas DataFrame.
If you are already using pandas
, then disarray
is easy to use, simply import disarray
:
import pandas as pd
df = pd.DataFrame([[18, 1], [0, 1]])
import disarray
df.da.sensitivity
0 0.947368
1 1.000000
dtype: float64
Table of contents
Installation
Install using pip
$ pip install disarray
Clone from GitHub
$ git clone https://github.com/arvkevi/disarray.git
$ python setup.py install
Usage
The disarray
package is intended to be used similar to a pandas
attribute or method. disarray
is registered as
a pandas
extension under da
. For a DataFrame named df
, access the library using df.da.
.
Binary Classification
To understand the input and usage for disarray
, build an example confusion matrix for a binary classification
problem from scratch with scikit-learn
.
(You can install the packages you need to run the demo with: pip install -r requirements.demo.txt
)
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Generate a random binary classification dataset
X, y = datasets.make_classification(n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# fit and predict an SVM
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[13 2]
[ 0 10]]
Using disarray
is as easy as importing it and instantiating a DataFrame object from a square array of positive integers.
import disarray
import pandas as pd
df = pd.DataFrame(cm)
# access metrics for each class by index
print(df.da.precision[1])
0.83
Class Counts
disarray
stores per-class counts of true positives, false positives, false negatives, and true negatives. Each of these are stored as capitalized abbreviations, TP
, FP
, FN
, and TN
.
df.da.TP
0 13
1 10
dtype: int64
Export Metrics
Use df.da.export_metrics()
to store and/or visualize many common performance metrics in a new pandas
DataFrame
object. Use the metrics_to_include=
argument to pass a list of metrics defined in disarray/metrics.py
(default is
to use __all_metrics__
).
df.da.export_metrics(metrics_to_include=['precision', 'recall', 'f1'])
0 | 1 | micro-average | |
---|---|---|---|
precision | 1.0 | 0.833333 | 0.92 |
recall | 0.866667 | 1.0 | 0.92 |
f1 | 0.928571 | 0.909091 | 0.92 |
Multi-Class Classification
disarray
works with multi-class classification confusion matrices also. Try it out on the iris dataset. Notice, the
DataFrame is instantiated with an index
and columns
here, but it is not required.
# load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# split the training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# train and fit a SVM
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
cm = confusion_matrix(y_test, y_pred)
# Instantiate the confusion matrix DataFrame with index and columns
df = pd.DataFrame(cm, index=class_names, columns=class_names)
print(df)
setosa | versicolor | virginica | |
---|---|---|---|
setosa | 13 | 0 | 0 |
versicolor | 0 | 10 | 6 |
virginica | 0 | 0 | 9 |
disarray
can provide per-class metrics:
df.da.sensitivity
setosa 1.000
versicolor 0.625
virginica 1.000
dtype: float64
In a familiar fashion, one of the classes can be accessed with bracket indexing.
df.da.sensitivity['setosa']
1.0
Currently, a micro-average is supported for both binary and multi-class classification confusion matrices. (Although it only makes sense in the multi-class case).
df.da.micro_sensitivity
0.8421052631578947
Finally, a DataFrame can be exported with selected metrics.
df.da.export_metrics(metrics_to_include=['sensitivity', 'specificity', 'f1'])
setosa | versicolor | virginica | micro-average | |
---|---|---|---|---|
sensitivity | 1.0 | 0.625 | 1.0 | 0.842105 |
specificity | 1.0 | 1.0 | 0.793103 | 0.921053 |
f1 | 1.0 | 0.769231 | 0.75 | 0.842105 |
Supported Metrics
'accuracy',
'f1',
'false_discovery_rate',
'false_negative_rate',
'false_positive_rate',
'negative_predictive_value',
'positive_predictive_value',
'precision',
'recall',
'sensitivity',
'specificity',
'true_negative_rate',
'true_positive_rate',
As well as micro-averages for each of these, accessible via df.da.micro_recall
, for example.
Why disarray?
Working with a confusion matrix is common in data science projects. It is useful to have performance metrics available directly from pandas DataFrames.
Since pandas
version 0.23.0
, users can easily
register custom accessors,
which is how disarray
is implemented.
Contributing
Contributions are welcome, please refer to CONTRIBUTING to learn more about how to contribute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file disarray-0.2.0.tar.gz
.
File metadata
- Download URL: disarray-0.2.0.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c55507cee35fc5a2a6c083b7dae67946889beadf67cd9eeb279238ea4ae43132 |
|
MD5 | 7e37d8c1b1614683cce52135fc216cd4 |
|
BLAKE2b-256 | d5f777dffec7b05b669af09736e988c92275096a2e3374bbab6f7b4dcc27ee8d |
File details
Details for the file disarray-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: disarray-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3165ca02c0719d5e55354a153f00a59c44507d8c2d4f27aa79815639ad4594bb |
|
MD5 | e32adfcf29aafeb93810ed33acba1d36 |
|
BLAKE2b-256 | ef006ae50e7d974b3649709df925619664145436bc5bfe71d9307c6e17fa0882 |