Skip to main content

The Python Data Valuation Library

Project description

pyDVL Logo

A library for data valuation.

Build Status
PyPI - License

Docs

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Data Valuation is the task of estimating the intrinsic value of a data point wrt. the training set, the model and a scoring function. We currently implement methods from the following papers:

Influence Functions compute the effect that single points have on an estimator / model. We implement methods from the following papers:

Installation

To install the latest release use:

$ pip install pyDVL

You can also install the latest development version from TestPyPI:

pip install pyDVL --index-url https://test.pypi.org/simple/

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

pyDVL uses Memcached to cache certain results and speed up computation. You can run it either locally or, using Docker:

docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest

You can read more in the caching module's documentation.

Once that's done, the steps required to compute values for your samples are

  1. Create a Dataset object with your train and test splits.
  2. Create an instance of a SupervisedModel (basically any sklearn compatible predictor)
  3. Create a Utility object to wrap the Dataset, the model and a scoring function.
  4. Use one of the methods defined in the library to compute the values.

This is how it looks for Truncated Montecarlo Shapley, an efficient method for Data Shapley values:

import numpy as np
from pydvl.utils import Dataset, Utility
from pydvl.value.shapley import compute_shapley_values
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X, y = np.arange(100).reshape((50, 2)), np.arange(50)
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.5, random_state=16
        )
dataset = Dataset(X_train, y_train, X_test, y_test)
model = LinearRegression()
utility = Utility(model, dataset)
values = compute_shapley_values(
        u=utility, max_iterations=100, mode="truncated_montecarlo"
    )

For more instructions and information refer to Getting Started in the documentation. We provide several examples with details on the algorithms and their applications.

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyDVL-0.3.0.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyDVL-0.3.0-py3-none-any.whl (80.5 kB view details)

Uploaded Python 3

File details

Details for the file pyDVL-0.3.0.tar.gz.

File metadata

  • Download URL: pyDVL-0.3.0.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for pyDVL-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2493bb2c97fb44029cf05634e02c4b8a60d5ec4a6386fda691ed8f68c5653164
MD5 319d988d41ad50a09197fdd7419ac90b
BLAKE2b-256 66d43a74377e3d7b2dc2427cbede0a722d41e8cc2dd2536fc3d82f76ec92fb7c

See more details on using hashes here.

File details

Details for the file pyDVL-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pyDVL-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 80.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for pyDVL-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3ae0fc8802b4db552a831d9d12d5ae85785fa6e0731b61f0875e48205bb1e76
MD5 3638c452681d8c01d2b746a22cfe6dfb
BLAKE2b-256 f23734d8c359fd20b768148cdc26fedb1c445a02554084e5892c14901873290f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page