Simple python package to sanitize in a standard way ML-related labels.
Project description
Simple python package to sanitize in a standard way ML-related labels.
Why do I need this?
So you have some kind of plot and you have some ML-related labels. Since I always rename and sanitize them the same way, I have prepared this package to always sanitize them in a standard fashion.
How do I install this package?
As usual, just download it using pip:
pip install sanitize_ml_labels
Usage examples
Here you have a couple of common examples: you have a set of metrics to normalize or a set of model names to normalize.
from sanitize_ml_labels import sanitize_ml_labels
# Example for metrics
labels = [
"acc",
"loss",
"auroc",
"lr"
]
sanitize_ml_labels(labels)
# ["Accuracy", "Loss", "AUROC", "Learning rate"]
# Example for models
labels = [
"vanilla mlp",
"vanilla cnn",
"vanilla ffnn",
"vanilla perceptron"
]
sanitize_ml_labels(labels)
# ["MLP", "CNN", "FFNN", "Perceptron"]
Corner cases
In some cases, you may have a combination of terms separated by hyphens that must be removed, plus words that are actually correctly written separated by hyphens. We approach this problem with an heuristic based on an extended list of over 45K hyphenated english words, originally retrieved from the Metadata consulting website.
From such a word list, we generate an index by running:
index = {}
for word in words:
word = word.lower()
index.setdefault(word[0], []).append((word, word[1:]))
And from there the user experience is transparent and looks as follows:
# Running the following
sanitize_ml_labels("non-existent-edges-in-graph")
# will yield the string `Non-existent edges in graph`
The lookup heuristic to quickly find an hyphenated word in a given label from the large haystack was written by Tommaso Fontana.
Extra utilities
Since I always use metric sanitization alongside axis normalization, it is useful to know which axis should be maxed between zero and one to avoid any visualization bias to the metrics.
For this reason I have created the method is_normalized_metric
, which after having normalized the given metric
validates it against known normalized metrics (metrics between 0 and 1, is there another name? I could not figure out a better one).
Analogously, I have also created the method is_absolutely_normalized_metric
to validate a metric for the range
between -1 and 1.
from sanitize_ml_labels import is_normalized_metric, is_absolutely_normalized_metric
is_normalized_metric("MSE") # False
is_normalized_metric("acc") # True
is_normalized_metric("accuracy") # True
is_normalized_metric("AUROC") # True
is_normalized_metric("auprc") # True
is_absolutely_normalized_metric("auprc") # False
is_absolutely_normalized_metric("MCC") # True
is_absolutely_normalized_metric("Markedness") # True
New features and issues
As always, for new features and issues you can either open a new issue and pull request. A pull request will always be the quicker way, but I’ll look into the issues when I get the time.
Tests Coverage
I have strived to mantain a 100% code coverage in this project:
Module |
statements |
missing |
excluded |
coverage |
---|---|---|---|---|
Total |
84 |
0 |
0 |
100% |
sanitize_ml_labels/__init__.py |
3 |
0 |
0 |
100% |
sanitize_ml_labels/__version__.py |
1 |
0 |
0 |
100% |
sanitize_ml_labels/is_normalized_metric.py |
10 |
0 |
0 |
100% |
sanitize_ml_labels/find_true_hyphenated_words.py |
19 |
0 |
0 |
100% |
sanitize_ml_labels/sanitize_ml_labels.py |
70 |
0 |
0 |
100% |
You can verify the test coverage of this repository by running in its root:
pytest --cov
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for sanitize_ml_labels-1.0.38.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 371a482ce34e597d0359ab80e5b3bfe90ba043c0cab4e9c0e5532cac550952b9 |
|
MD5 | 540f8bf8f952d2529d45cdb7ee555eb4 |
|
BLAKE2b-256 | 1481185ddce581428c94b4f2b85b88d72e5d0729650d4ecb6d29793c11b364d5 |