Skip to main content

The Python Data Valuation Library

Project description

pyDVL Logo

A library for data valuation.

PyPI Version documentation License Build status DOI

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Refer to the Methods page of our documentation for a list of all implemented methods.

Data Valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance or outcome of some model trained on it. Some concepts of value depend on a specific model of interest, while others are model-agnostic. pyDVL focuses on model-dependent methods.

best sample removal

Comparison of different data valuation methods on best sample removal.

The Influence Function is an infinitesimal measure of the effect that single training points have over the parameters of a model, or any function thereof. In particular, in machine learning they are also used to compute the effect of training samples over individual test points.

best sample removal

Influences of input points with corrupted data. Highlighted points have flipped labels.

Installation

To install the latest release use:

$ pip install pyDVL

You can also install the latest development version from TestPyPI:

pip install pyDVL --index-url https://test.pypi.org/simple/

pyDVL has also extra dependencies for certain functionalities, e.g. for using influence functions run

$ pip install pyDVL[influence]

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

In the following subsections, we will showcase the usage of pyDVL for Data Valuation and Influence Functions using simple examples.

For more instructions and information refer to Getting Started in the documentation. We provide several examples for data valuation (e.g. Shapley Data Valuation) and for influence functions (e.g. Influence Functions for Neural Networks) with details on the algorithms and their applications.

Influence Functions

For influence computation, follow these steps:

  1. Import the necessary packages (The exact packages depend on your specific use case).

    import torch
    from torch import nn
    from torch.utils.data import DataLoader, TensorDataset
    
    from pydvl.influence.torch import DirectInfluence
    from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
    from pydvl.influence import SequentialInfluenceCalculator
    
  2. Create PyTorch data loaders for your train and test splits.

    input_dim = (5, 5, 5)
    output_dim = 3
    train_x = torch.rand((10, *input_dim))
    train_y = torch.rand((10, output_dim))
    test_x = torch.rand((5, *input_dim))
    test_y = torch.rand((5, output_dim))
    
    train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
    test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
    
  3. Instantiate your neural network model.

    nn_architecture = nn.Sequential(
      nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
      nn.Flatten(),
      nn.Linear(27, 3),
    )
    
  4. Define your loss:

    loss = nn.MSELoss()
    
  5. Instantiate an InfluenceFunctionModel and fit it to the training data

    infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
    infl_model = infl_model.fit(train_data_loader)
    
  6. For small input data call influence method on the fitted instance.

    influences = infl_model.influences(test_x, test_y, train_x, train_y)
    

    The result is a tensor of shape (training samples x test samples) that contains at index (i, j) the influence of training sample i on test sample j.

  7. For larger data, wrap the model into a calculator and call methods on the calculator.

    infl_calc = SequentialInfluenceCalculator(infl_model)
    
     # Lazy object providing arrays batch-wise in a sequential manner
    lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
    
    # Trigger computation and pull results to memory
    influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
    
    # Trigger computation and write results batch-wise to disk
    lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
    

    The higher the absolute value of the influence of a training sample on a test sample, the more influential it is for the chosen test sample, model and data loaders. The sign of the influence determines whether it is useful (positive) or harmful (negative).

Note pyDVL currently only support PyTorch for Influence Functions. We are planning to add support for Jax and perhaps TensorFlow or even Keras.

Data Valuation

The steps required to compute data values for your samples are:

  1. Import the necessary packages (The exact packages depend on your specific use case).

    import matplotlib.pyplot as plt
    from sklearn.datasets import load_breast_cancer
    from sklearn.linear_model import LogisticRegression
    from pydvl.utils import Dataset, Scorer, Utility
    from pydvl.value import (
       compute_shapley_values,
       ShapleyMode,
       MaxUpdates,
    )
    
  2. Create a Dataset object with your train and test splits.

    data = Dataset.from_sklearn(
        load_breast_cancer(),
        train_size=10,
        stratify_by_target=True,
        random_state=16,
    )
    
  3. Create an instance of a SupervisedModel (basically any sklearn compatible predictor).

    model = LogisticRegression()
    
  4. Create a Utility object to wrap the Dataset, the model and a scoring function.

    u = Utility(
       model,
       data,
       Scorer("accuracy", default=0.0)
    )
    
  5. Use one of the methods defined in the library to compute the values. In our example, we will use Permutation Montecarlo Shapley, an approximate method for computing Data Shapley values.

    values = compute_shapley_values(
       u,
       mode=ShapleyMode.PermutationMontecarlo,
       done=MaxUpdates(100),
       seed=16,  
       progress=True
    )
    

    The result is a variable of type ValuationResult that contains the indices and their values as well as other attributes.

    The higher the value for an index, the more important it is for the chosen model, dataset and scorer.

  6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.

    df = values.to_dataframe(column="data_value")
    

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydvl-0.9.tar.gz (158.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyDVL-0.9.0-py3-none-any.whl (187.4 kB view details)

Uploaded Python 3

File details

Details for the file pydvl-0.9.tar.gz.

File metadata

  • Download URL: pydvl-0.9.tar.gz
  • Upload date:
  • Size: 158.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for pydvl-0.9.tar.gz
Algorithm Hash digest
SHA256 7633e3676a750699626d1c06c0b90e813a45f11559391613379b87341ef33c25
MD5 2abb77f9e9a2313393e7dd8be34a5976
BLAKE2b-256 f7a732ba6dd79eced8308debf03c4bba5b04b70791badccc73dd5e8580e6ae12

See more details on using hashes here.

File details

Details for the file pyDVL-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: pyDVL-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 187.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for pyDVL-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c5cde6a03376e214eff59a2b221864212b7b10b5f8a840e2acbb79a99e6a0b2
MD5 8022a30631b9fc31e72d7bb7b2d29998
BLAKE2b-256 e71974a268984fa175cb9d1d8a5a8d86e4f419e3f58e5e2dc8fb53d313e8784b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page