Measuring data importance over ML pipelines using the Shapley value.

Project description

Ease.ml/Datascope: Guiding your Data-centric Data Iterations, over End-to-end ML pipelines

Developing ML applications are data-centric --- often the quality of your model is a reflection of the quality of your underlying data. In the era of data- centric AI, the fundamental question becomes

Which training data example is most important to improve the accuracy/fairness of my ML model?

Once you know these "importances", we can use it to support a range of applications --- clean your data and fix your data bugs, data acquisition, data summarization, etc. (e.g., https://arxiv.org/pdf/1911.07128.pdf).

DataScope is a tool for inspecting ML pipelines by measuring how important each training data point is. The most prominent feature of DataScope is that it supports not only a single ML model, but also any sklearn Pipeline --- it is also super fast, up to four orders of magnitude faster than previous approaches. The secret sauce of DataScope is a collection of new results on computing the Shapley value of a specific family of ML models (K-nearest neighbor classifiers) in PTIME, over relational data provenances. If you want to learn more about how DataScope works, the main reference is https://arxiv.org/abs/2204.11131, and a series of our previous studies on KNN Shapley proxies can be found at https://ease.ml/datascope.

In just seconds, you will be able to get the importance score for each of your training examples, and get your data-centric cleaning/debugging iterations started!

DataScope is part of the Ease.ML data-centric ML DevOps eco-system: https://Ease.ML

References

@inproceedings{karlas2024,
 author = {Bojan Karlaš and David Dao and Matteo Interlandi and Sebastian Schelter and Wentao Wu and Ce Zhang},
 title = {Data Debugging with Shapley Importance over Machine Learning Pipelines},
 booktitle={The Twelfth International Conference on Learning Representations},
 year={2024},
 url={https://openreview.net/forum?id=qxGXjWxabq},
}

Quick Start

Install by running:

pip install datascope

We can compute the Shapley importance scores for some scikit-learn pipeline pipeline using a training dataset (X_train, y_train) and a validation dataset (X_val, y_val) as such:

from datascope.importance.common import SklearnModelAccuracy
from datascope.importance.shapley import ShapleyImportance

utility = SklearnModelAccuracy(pipeline)
importance = ShapleyImportance(method="neighbor", utility=utility)
importances = importance.fit(X_train, y_train).score(X_val, y_val)

The variable importances contains Shapley values of all data examples in (X_train, y_train) computed using the nearest neighbor method (i.e. "neighbor").

For a more complete example workflow, see the demo Colab notebook.

Why datascope?

Shapley values help you find faulty data examples much faster than if you were going about it randomly. For example, let's say you are given a dataset with 50% of labels corrupted, and you want to repair them one by one. Which one should you select first?

Example data repair workflow using datascope

In the above figure, we run different methods for prioritizing data examples that should get repaired (random selection, various methods that use the Shapley importance). After each repair, we measure the accuracy achieved on an XGBoost model. We can see in the left figure that each importance-based method is better than random. Furthermore, for the KNN method (i.e. the "neighbor" method), we are able to achieve peak performance after repairing only 50% of labels.

ease.ml/datascope speeds up data debugging by allowing you to focus on the most important data examples first

If we look at speed (right figure), we measure three different methods (the "neighbor" method and the "montecarlo" method for 10 iterations and 100 iterations). We can see that our KNN-based importance computation method is orders of magnitude faster than the state-of-the-art Monte-Carlo method.

The "neighbor" method in ease.ml/datascope can compute importances in seconds for datasets of several thousand examples

Project details

Release history Release notifications | RSS feed

This version

0.0.32

Feb 5, 2025

0.0.31

Sep 30, 2024

0.0.30

Jul 5, 2024

0.0.29

May 25, 2024

0.0.28

May 21, 2024

0.0.27

May 18, 2024

0.0.26

May 2, 2024

0.0.25

Apr 17, 2024

0.0.24

Apr 9, 2024

0.0.23

Mar 2, 2024

0.0.22

Mar 1, 2024

0.0.21

Mar 1, 2024

0.0.20

Mar 1, 2024

0.0.19

Feb 28, 2024

0.0.18

Feb 24, 2024

0.0.17

Feb 23, 2024

0.0.16

Feb 20, 2024

0.0.15

Feb 17, 2024

0.0.14

Feb 17, 2024

0.0.13

Aug 15, 2023

0.0.12

May 15, 2023

0.0.10

Apr 5, 2023

0.0.9

Feb 11, 2023

0.0.8

Dec 27, 2022

0.0.6

Dec 26, 2022

0.0.5

Dec 26, 2022

0.0.4

Dec 23, 2022

0.0.3

Jun 16, 2022

0.0.3a0 pre-release

Jun 16, 2022

0.0.2

Apr 23, 2022

0.0.1

Apr 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascope-0.0.32.tar.gz (33.9 kB view details)

Uploaded Feb 5, 2025 Source

File details

Details for the file datascope-0.0.32.tar.gz.

File metadata

Download URL: datascope-0.0.32.tar.gz
Upload date: Feb 5, 2025
Size: 33.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for datascope-0.0.32.tar.gz
Algorithm	Hash digest
SHA256	`8b6ce9028db6ff2f9ee88c50bf634d0835f1194eb087ecee4a860a6b36bcd20b`
MD5	`9c0c82ca60bf491c5acbfbc1c88538f5`
BLAKE2b-256	`35546ccca0b57d48e242e3980676452984b7a9a24a7075a6d0a4a704339eb9e2`

See more details on using hashes here.

datascope 0.0.32

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Ease.ml/Datascope: Guiding your Data-centric Data Iterations, over End-to-end ML pipelines

References

Quick Start

Why datascope?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes